Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Oct 16;15:36223. doi: 10.1038/s41598-025-20226-z

Efficient fusion transformer model for accurate classification of eye diseases

Ankang Lin 1,
PMCID: PMC12533095  PMID: 41102308

Abstract

The automatic diagnosis model of medical image based on deep learning can improve the diagnosis efficiency and reduce the diagnosis cost. At present, there is a lack of research on special artificial intelligence models for medical image analysis of fundus disease characteristics. Considering that fundus diseases have both local and global features, this paper proposes a novel deep learning model Local-Global Scale Fusion Network (LGSF-Net). The novelty lies in a dual-stream fusion design that processes global context (Transformer) and local details (CNN) in parallel with residual fusion. On the public fundus dataset, LGSF-Net delivers 96% accuracy with only 18.7K parameters and 0.93 GFLOPs, outperforming existing state-of-the-art universal methods like ResNet50 and ViT. LGSF-Net is more suitable for clinical diagnosis because of its accuracy and lightweight design. The ablation study shows that the concept of LGSF-Net multi-scale fusion understanding has been correctly realized. This work effectively promotes the development of smart medicine and provides a new solution for the design of new deep learning models.

Keywords: Local-Global Scale Fusion Network, Convolutional neural networks, Transformers, Fundus images, Medical image analysis.

Subject terms: Computer science, Software

Introduction

Millions of people worldwide suffer from various eye conditions, which are influenced by different components of the visual system. Some eye conditions can cause severe or even irreversible vision impairment. In particular, the rising prevalence of cataracts1,2, diabetic retinopathy3, and glaucoma4 significantly contributes to global vision impairment, forming the focus of this research. Cataract is a visual impairment caused by the clouding of the eye’s crystalline lens. Early identification is crucial for cataract treatment5. Accurate classification in the early stages is invaluable for improving cataract treatment outcomes. Another common eye disease that causes vision loss is diabetic retinopathy (DR). DR is a retinal microvascular disease that can cause severe vision impairment or blindness in the diabetic population6. Symptoms of DR, such as blurred vision, difficulties with color perception, and eyeball floaters, can be so subtle that even a comprehensive ocular examination is required7. Moreover, glaucoma is a blinding ophthalmic disease characterized by optic nerve damage and defects in retinal ganglion cells, often associated with elevated intraocular pressure. Factors contributing to glaucoma include age-related oxidative stress and initial optic nerve damage8. These challenges underscore the difficulty of its classification in clinical practice.

Recent studies have highlighted imaging modalities as effective tools in medical imaging, especially for diagnosing eye conditions and diseases. For instance, Optical Coherence Tomography (OCT) provides insights into the retinal inner structure, facilitating the diagnosis of eye diseases such as cataracts and glaucoma. While OCT is prevalent in North America, hospitals worldwide, including those in the Asia-Pacific region, are increasingly adopting this technology to aid in diagnosing and treating eye diseases9. Other imaging techniques, including Magnetic Resonance Imaging (MRI), X-ray, and digital mammography, also contribute significantly to real-world diagnostic practices10. However, despite the improvements brought by traditional imaging technologies, the high costs of training qualified ophthalmologists and radiologists result in a lack of these resources, preventing many, especially in underdeveloped areas, from accessing accurate diagnoses at early stages. To address this issue, deep learning (DL) has been extensively deployed to assist in diagnosis by analyzing medical examination results11, such as fundus images, using state-of-the-art models like Deep Neural Networks (DNNs)12, Convolutional Neural Networks (CNNs), Vision Transformers13, and Generative Adversarial Networks (GANs)14. Deep learning not only helps mitigate resource shortages by enabling doctors to save interpretation time15, but also allows patients with limited access to medical resources to receive more accurate diagnoses.

However, some challenges remain for the application of deep learning in disease classification. Firstly, the lack of generalization, low classification accuracy, and computational inefficiency hinder deep learning from being practical enough for widespread hospital use. While some novel models achieve over 95Inline graphic Area Under the Receiver Operating Curve (AUC)1618, few demonstrate comparable performance on new medical imaging modalities15. Moreover, data scarcity and the complexity of medical datasets pose significant challenges to further improving deep learning models for medical classification. Current models are typically trained on datasets much smaller than the practical scale, resulting in suboptimal performance. Furthermore, professional medical images contain complex nuances that significantly affect diagnosis. Consequently, this complexity detracts from the classification accuracy of deep learning models. Another major challenge is that recent research either focuses on detection based on local details or on global features, with few models attempting to integrate both types of information to enhance their performance.

Each specific medical image has its own prior features. As shown in Fig. 1, the fundus lesion has both global and local lesion features. Therefore, a dedicated deep learning model still needs to be designed for this prior. To solve the problem that existing models ignore multi-scale information fusion understanding, this paper proposes an innovative deep learning model Local-Global Scale Fusion Network (LGSF-Net) for classifying eye diseases. This model fuses local features with global information for understanding to identify eye diseases more accurately under the condition of limited computing resources.

Fig. 1.

Fig. 1

Example of fundus image of cataract. Cataract is a common eye disease associated with aging, featuring global characteristics such as retinal darkening and halogen, as well as local features such as partial lens opacity (the red box area) and vascular changes (the green box area).

The classification task for eye disease can be mathematically expressed as follows: given an input image Inline graphic, the model computes the probability of each class Inline graphic using the softmax function:

graphic file with name d33e280.gif 1

where Inline graphic is the output logit of class c from the model, and C is the total number of classes (e.g., cataract, diabetic retinopathy, glaucoma, and normal).

The predicted class Inline graphic is determined by:

graphic file with name d33e308.gif 2

which corresponds to the class with the highest probability.

Finally, the classification result can be represented as a one-hot vector Inline graphic, where:

graphic file with name d33e323.gif 3

This one-hot vector indicates the final classification output.

With the development of deep learning, many models with excellent feature representation ability have been proposed. Comparisons with other state-of-the-art models are operated in order to verify the accuracy and performance of LGSF-Net. In the experiment, InceptionV319, ResNet5020, Vision Transformer (ViT)21, Swin Transformer (Swin)22, and Vision Mamba (Vim)23 will be trained under the same conditions as LGSF-Net. By comparing and analyzing metrics such as AUC, F-score, and the confusion matrix, the effectiveness of LGSF-Net will be thoroughly evaluated. The main contributions of this paper are summarized as follows:

  • A deep learning model LGSF-Net is constructed to integrate global and local multi-scale understanding for eye disease classification.

  • Through a series of comparative experiments, the model’s effective generalization performance and accuracy have been verified, advancing the development of intelligent healthcare.

  • A novel solution is provided for integrating different conceptual frameworks in deep learning models, offering new directions for future research.

In the following sections, we first provide a brief review of related work, and then provide the model architecture in the methodology section, comparing the conceptual differences with mainstream generic models. In the experiment section we give the implementation details of the model and main results, and the ablation study verifies the effectiveness of the proposed concept. The paper concludes after discussing the extensibility of method and limitations of this work.

Related work

Traditional machine learning methods for medical image classification

In traditional machine learning field, many supervised learning methods are proposed for classifying medical images. Since 2015, Support Vector Machine (SVM), Random Forest (RF) and K-nearest Neighbors (KNN) have prevailed in the field of automatic medical detection24. These classification approaches focus on predicting the possibilities of diseases according to the features extracted from the medical images through large amount of data training and pattern recognition. First of all, due to its fast speed of calculation, several researchers utilized SVM into recognizing targeting medical images. Vijayarajeswari et al.25 revised a new SVM classifier with two-dimensional transform for detecting breast cancer. Latif et al.26 developed a new glaucoma classification approach, EGWO-SVM which can proceed extracted features better. However, SVM’s main drawback lies on its inefficiency while training with a large datasets, preventing further application considering the large amount of datasets required in clinical practice24. In response to this problem, other researches adopt RF classifier which is more effective in handling large amount of data with nonlinear patterns in medical detection2729. Although random forest outperform in huge datasets, the architecture is too completed in some conditions that RF can not achieve satisfying processing rate. Another popular method is KNN, which focuses on the similarity of the input and trained data. It is widely used in improving confirmity by recognizing major diseases such as glaucoma30, breast tumor31 and ankylosing spondylitis32.

Deep learning methods for medical image classification

CNN-based models have significantly advanced medical image classification by enabling the automatic learning of spatial hierarchies of features. ResNet20, a widely adopted architecture, introduced residual learning to address the vanishing gradient issue, allowing the creation of deeper and more effective networks. This approach has proven beneficial in various applications, such as analyzing retinal fundus images to detect diabetic retinopathy20,33. Similarly, EfficientNet, an architecture that optimally scales network dimensions through compound scaling, has demonstrated impressive accuracy in tasks like lung disease detection using chest X-rays34,35. The combination of these architectures with preprocessing techniques, such as image augmentation and transfer learning, has further enhanced their performance in applications such as brain tumor classification using MRI scans36. Despite these advancements, CNNs face challenges related to interpretation and computational efficiency, particularly when dealing with high-resolution medical images. This conclusion inspires us to further fuse CNN with other up-to-date model frameworks so that the advantages of CNN can be maintained while the disadvantages can be alleviated.

One significant limitation in the current literature is the absence of models specifically designed for medical imaging tasks, which requires much higher accuracy from large amounts of classification. While general-purpose architectures such as ResNet20 and EfficientNet have shown effectiveness, they often fail to address the unique characteristics of medical datasets. These datasets frequently exhibit small sample sizes and high intra-class variability, which require domain-specific modifications. Incorporating prior anatomical knowledge or task-specific constraints into the models could significantly improve their performance15. Therefore, a new model which targets on the characteristics of medical images such as information variety should be taken into account in order to improve the accuracy of diagnosis.

Another under-explored area is the application of contrastive learning and state-space modeling techniques in medical imaging. Contrastive learning, which has shown promise in unsupervised learning for computer vision, could be highly beneficial in leveraging unannotated medical data to learn meaningful representations. This approach has the potential to improve classification performance while reducing the reliance on annotated data37. Similarly, state-space modeling, commonly used in time-series analysis, could provide new opportunities for analyzing structural and sequential dependencies in medical images. These methods could be particularly useful for volumetric imaging datasets or longitudinal studies, enabling better modeling of disease progression patterns38.

Local-global deep learning models

In recent years, multi-scale fusion understanding has been one of the important design concepts for improving deep learning models, aiming to consider both the global and local features of visual semantics simultaneously. Local-global deep learning models are widely used in general image restoration39 and removal of specific image noise40. This design concept is relatively simple to implement and can significantly improve the performance of the model, providing new improvement ideas for many computer vision tasks.

The medical images of many diseases also have the characteristics of multiple scales simultaneously. Therefore, some models for medical image analysis based on the concept of multi-scale fusion understanding have emerged41,42. These works do not involve the analysis of eye diseases. The prior knowledge about eye diseases also indicates that fundus images will show lesion characteristics of different scales. The overall changes of the lens and retina need to be combined with local changes such as vascular margins to accurately diagnose the disease.

Methodology

Model architecture overview

We propose LGSF-Net (Local-Global Scale Fusion Network), a novel dual-stream architecture that effectively combines convolutional neural networks (CNNs) and transformers for retinal image analysis. As illustrated in Fig. 2, our model employs a parallel processing strategy that leverages both local feature extraction capabilities of CNNs and global context modeling of transformers. With parallel usage of CNN and transformer, this model is expected to manage notice local details such as features of capillaries and global information such as the color and shape of the fundus, which is highly suitable for retinal disease diagnosis. Despite the key modules of CNN and transformer, other popular techniques such as preprocessing, residual addition, average pooling and so on are also adopted into the architecture.

Fig. 2.

Fig. 2

This figure illustrates the comprehensive architecture of our proposed LGSF-Net model. (a) Data preprocessing pipeline: The fundus image dataset undergoes initial preprocessing, followed by stratified partitioning into training, validation, and test sets. The training set is duplicated to enable parallel processing through the dual-stream network. (b) Overall model architecture: The dual-stream design processes the cloned inputs (Inline graphic), (Inline graphic) through complementary pathways. The first stream processes data sequentially through ConvBlock followed by transformer modules, while the second stream reverses this order. The outputs from both streams are combined via element-wise addition to produce the final prediction. (c) ConvBlock structure: Each ConvBlock consists of three cascaded Inline graphic convolutional layers with ReLU activation functions, maintaining consistent feature dimensionality (2d) throughout. (d) transformer module architecture: The module implements a multi-head attention mechanism with three attention heads. The input features undergo parallel processing through Query (Q), Key (K), and Value (V) linear transformations, followed by scaled dot-product attention. The attention outputs are concatenated and processed through a feed-forward network with dropout regularization.

Combining the local feature learning ability of convolutional blocks and the global feature learning ability of attention, a Pytorch43 style implementation of LGSF-Net is shown in Algorithm 1. The core includes convolutional blocks (denoted as Inline graphic, details in 3.2) and Transformer blocks (denoted as Inline graphic, details in 3.3).

Algorithm 1.

Algorithm 1

Pytorch-style Pseudocode for the LGSF-Net model.

ConvBlock module

The ConvBlock, as shown in Fig. 2(c), consists of three cascaded 3Inline graphic3 convolutional layers with ReLU activation. Each convolution operation can be expressed as

graphic file with name d33e569.gif 4

where Inline graphic represents the convolutional kernel, Inline graphic denotes the convolution operation, and b is the bias term. The complete ConvBlock operation can be written as:

graphic file with name d33e592.gif 5

Transformer module

The transformer module (Fig. 2(d)) implements a multi-head attention mechanism with three attention heads. For an input feature map Inline graphic, the attention operation44 is computed as

graphic file with name d33e615.gif 6

where Q, K, and V are linear projections of the input:

graphic file with name d33e632.gif 7
graphic file with name d33e638.gif 8
graphic file with name d33e644.gif 9

The multi-head attention is computed as:

graphic file with name d33e651.gif 10

where each head is computed independently using different learned projections. After that, a block of feed forward is added for normalizing the results, preparing for entering the next step. The feed forward step is composed with two repetitive combination of linear, ReLU activation and dropout layers. Each combination can be denoted as

graphic file with name d33e658.gif 11

where X is the output of attention part. The whole feed forward part can be expressed as

graphic file with name d33e666.gif 12

where Inline graphic is the final output of the transformer block. The transformer block hyperparameters we selected are shown in Table 1.

Table 1.

Hyperparameter settings of the transformer model.

Hyperparameter Value
Tensor Channels 16
Transformer Depth 1
Attention Heads 2
Feedforward Dimension 64
Dropout Ratio for Training 0.1

Concepts and state-of-the-art models for comparison

The first concept is dual-stream processing. The parallel processing streams enable simultaneous capture of local and global features. The first stream (ConvBlock Inline graphic Transformer) emphasizes local feature extraction before global context modeling, while the second stream (Transformer Inline graphic ConvBlock) prioritizes global dependencies before local refinement. The combination of convolution network and transformer in this model boost improved performance in medical image classification.

For comparative evaluation, we benchmark LGSF-Net against state-of-the-art models including ResNet5020, ViT21, Swin Transformer22, and Vision Mamba23.

Experiments

Dataset

The dataset used in this study is publicly available medical images, sourced from our data availability statement, including fundus images and optical coherence tomography (OCT) scans. This dataset is specifically used to classify four types of eye diseases: cataract, diabetic retinopathy, glaucoma and normal. Figure 3 summarizes the class distribution of the dataset and shows the balanced representation of each category to ensure unbiased model training and evaluation.

Fig. 3.

Fig. 3

Detailed information of the dataset. Each category contains different resolutions, formats and aspect ratios.

Implementation details

The experiments were conducted with a controlled setup to ensure consistency and reliability of the evaluation, while we preprocessed the data with some common methods. The detailed configuration is shown in Table 2. The deep learning framework Pytorch was used to implement our model. For the optimizer Adam, set the hyperparameters (Inline graphic, Inline graphic). For the training set, there is a 0.1 probability of using a random data augmentation. When the loss no longer decreases steadily, the learning rate is manually halved. When the loss on the validation set no longer decreases, the training is terminated.

Table 2.

Experiment setup and Hyperparameter configurations.

Parameter Configuration
Optimizer Adam45
Initial Learning Rate 0.001
Batch Size 16
Number of Epochs 50
Weight Decay Regularization Inline graphic
Hardware Environment Two NVIDIA RTX 3080 GPUs (20GB each),
Intel Xeon Platinum 8352V CPU
Data Augmentation Techniques Random rotation (Inline graphic), horizontal and vertical
flipping, random cropping (90%), Gaussian noise
Image Resolution Inline graphic pixels
Dataset Split Ratio 80% training, 10% validation, 10% testing
Loss Function Cross Entropy Loss

Metrics

To comprehensively evaluate the performance of the proposed model, the following metrics were employed:

  • Accuracy: The overall classification accuracy is calculated as:
    graphic file with name d33e905.gif
  • Precision: Precision measures the proportion of true positive predictions among all positive predictions:
    graphic file with name d33e913.gif
  • Recall (Sensitivity): Recall evaluates the proportion of actual positives correctly identified:
    graphic file with name d33e921.gif
  • Specificity: Specificity evaluates the proportion of actual negatives correctly identified:
    graphic file with name d33e929.gif
  • F1-Score: The F1-Score represents the harmonic mean of precision and recall:
    graphic file with name d33e937.gif
  • Area Under the Curve (AUC): AUC quantifies the ability of the model to distinguish between classes. It is derived from the Receiver Operating Characteristic (ROC) curve, which plots the true positive rate against the false positive rate at various threshold settings.

  • Receiver Operating Characteristic (ROC): The ROC curve visualizes the trade-off between sensitivity and specificity across different thresholds, providing a graphical representation of model performance.

  • Confidence Interval (CI): To quantify the reliability of the evaluation metrics, a 95% confidence interval is computed for each metric. The confidence interval for the accuracy is calculated using the following formula:
    graphic file with name d33e951.gif
    where Inline graphic is the observed accuracy, Inline graphic is the critical value for a 95% confidence level (Inline graphic), and Inline graphic is the total number of samples. Similar calculations are applied to precision, recall, and F1-Score, providing statistical insight into the stability and reliability of the results.

Computational complexity analysis

In order to gain intuitive understanding about the characteristics of LGSF-Net, we compute floating point operations (FLOPs), parameters and accuracy from LGSF-Net, ResNet5020, InceptionV319, ViT21, Swin Transformer22 and Vision Mamba23. As shown in Fig. 4, LGSF-Net performs the best in all indicators. It is noteworthy that LGSF-Net’s scale of parameters is much smaller than any of the models, with 18,680 total parameters. Among the five baselien models, Vision Mamba and InceptionV3 are known for their efficient model architecture, both of which keep their parameter scales around 25 million. However, the proposed model manage to progressively reduce its parameters to 18,676, which is only the 0.000075 scale Vision Mamba and InceptionV3. Meanwhile, LGSF-Net still keeps the highest performance (95.97%). The performance of accuracy is 2.97% higher than ResNet5020, which has the second best accuracy among all five models. In terms of floating point operations (FLOPs), the best baseline model is still InceptionV3, which only operates 5.69 GFlOPs. LGSF-Net is able to further improve on that base, reducing its floating operations to 0.93 GFLOPs. Therefore, by combining the advantages of CNN and transformer models, the proposed model is able to maintain high performance while keep the model small and the classification efficient, making it more suitable for target eye disease classification tasks compared to other popular models. The simple architecture with small GFLOPs and parameters also guarantee LGSF-Net’s potential application value, which can provide practical hospitals with efficient and economic solutions of diagnosing eye disease.

Fig. 4.

Fig. 4

Computational complexity analysis compared with popular models.

Classification metrics comparison

The classification report of the proposed LGSF-Net was calculated on the test set and compared with the other five state-of-the-art models. Figure 5 summarize the classification metrics for each model.

Fig. 5.

Fig. 5

Classification metrics comparison.

Among other models, ResNet5020 demonstrated the best performance, achieving an accuracy of 94% and macro-average precision, recall, and F1-Score of 0.94. ResNet50’s deep learning framework is highly effective at detecting detailed features in fundus images. Its residual connections mitigate the vanishing gradient problem, enabling the network to learn intricate capillary details, such as position, thickness, and distribution, across multiple layers46. Although ResNet50 requires longer training times due to its complexity20, this model’s ability to focus on local features makes it highly effective for classifying eye diseases. This precision in capturing fine-grained details explains its strong performance across all disease categories.

The ViT21 also achieved competitive results, with an accuracy of 90% and a macro-average F1-Score of 0.90. The ViT leverages its attention mechanism to model both contextual and local relationships within the input features. This mechanism works particularly well with fundus images by integrating interrelated information, such as variations in capillary characteristics, and analyzing the relationships between local details. This ability to capture and analyze feature interdependencies explains its robust performance, although it falls slightly short of ResNet5020 in terms of accuracy and F1-Score.

InceptionV319 exhibited slightly lower performance, achieving an accuracy of 89%. Its relatively simple architecture, characterized by fewer parameters and reduced computational complexity, may limit its ability to process and retain the intricate and large-scale information present in fundus images. However, InceptionV3’s lightweight design and efficient operations make it suitable for resource-constrained clinical applications. Despite its lower performance compared to ResNet5020 and ViT21, InceptionV3 remains a viable option in scenarios with limited computational budgets, offering an effective balance between performance and efficiency.

In contrast, Swin Transformer22 showed the lowest performance among the baseline models, achieving an accuracy of 81%. This suboptimal performance may be attributed to its small window partitioning mechanism, which segments input images into smaller patches and processes them locally. While this technique reduces computational overhead, it risks losing valuable relational information between patches, particularly in complex datasets like fundus images22,47. Consequently, the Swin Transformer struggled to achieve the same level of accuracy as the other models, highlighting the importance of maintaining global contextual information in medical image classification tasks.

The proposed LGSF-Net outperformed all baseline models, achieving the highest accuracy of 96% on the test set, with macro-average precision, recall, and F1-Score values of 0.96. Notably, the model achieved a recall of 0.99 and an F1-Score of 0.99 for the diabetic retinopathy class, demonstrating its superior ability to handle this challenging disease. Furthermore, the model achieved precision and recall values above 0.92 for all classes, highlighting its robustness across different categories of eye diseases.

Confusion matrix

The confusion matrix provides a detailed breakdown of classification performance for the proposed LGSF-Net and five baseline models. The TP, FP, TN, FN indicators for each condition are summarized in Table 3. A detailed comparison of the confusion matrix is shown in Fig. 6.

Table 3.

TP, FP, TN, FN indicators comparison for all models.

Model Cataract Diabetic Retinopathy Glaucoma
TP FP TN FN TP FP TN FN TP FP TN FN
InceptionV3 98 6 93 6 99 6 89 11 86 3 92 15
ResNet50 101 3 96 3 104 2 93 6 90 2 91 11
ViT 99 4 95 5 100 4 91 10 87 4 90 14
Swin Transformer 95 8 91 9 93 7 88 12 80 6 85 15
Vision Mamba 96 7 92 8 96 5 90 11 85 5 88 13
LGSF-Net (Proposed) 100 4 95 2 99 1 95 6 86 2 92 1

Fig. 6.

Fig. 6

Confusion matrix comparison. For clear visualization, class codes are used to represent eye diseases. Class 0, 1, 2, 3 represent cataract, diabetic retinopathy, glaucoma and normal respectively.

LGSF-Net demonstrates superior performance in all three disease categories compared to the baseline models. For cataract classification, LGSF-Net achieved the highest number of true positives (100) and the lowest number of false negatives (2), outperforming all other models. While ResNet5020 and ViT21 also demonstrated strong performance for cataract, both had higher false negative rates, with ResNet5020 and ViT21 misclassifying 3 and 5 samples, respectively.

In the diabetic retinopathy category, LGSF-Net achieved the best overall performance with 99 true positives and only 1 false positive, reflecting a precision of 0.99 and a recall of 0.99. ResNet5020, while competitive, achieved slightly fewer true positives (104) but had more false positives (2). Other models, including Swin Transformer22 and Vision Mamba23, demonstrated lower recall rates due to higher false negative counts.

For glaucoma classification, which has the highest average FN number, LGSF-Net also outperformed the other models, achieving 86 true positives and the lowest false negative count (1), compared to ViT21 (87 true positives, 4 false negatives) and Swin Transformer (80 true positives, 6 false negatives). The model’s high true positive rate and minimal false negatives underscore its robustness in identifying glaucoma cases.

The confusion matrix results indicate that Swin Transformer22 and ViT21 fail to achieve the expected high performance, primarily due to their inability to effectively capture overall image features. These models exhibit the highest average false positive and false negative rates among all tested architectures. While their attention mechanisms provide notable improvements in understanding local information, they struggle to simultaneously interpret medical images at a global scale, a critical requirement for accurately classifying eye diseases. This limitation underscores the importance of integrating both local feature extraction and global contextual understanding within a single model. The superior performance of the proposed LGSF-Net, which fuses these two capabilities, further validates its effectiveness and highlights its potential as a robust solution for medical image classification tasks.

Across all categories, LGSF-Net consistently demonstrated better balance between sensitivity and specificity, as indicated by its minimal false positives and false negatives. Furthermore, the model maintained high performance even in challenging cases such as normal classification, where other models struggled with false negatives. This comparison highlights the effectiveness of LGSF-Net in addressing the limitations of baseline architectures and achieving state-of-the-art performance in medical image classification tasks.

ROC and AUC

The Receiver Operating Characteristic (ROC) curves and Area Under the Curve (AUC) values were analyzed to evaluate the classification performance of the proposed Local-Global Scale Fusion Network (LGSF-Net) and several baseline models, including InceptionV319, ResNet5020, ViT21, and Swin Transformer22. Table 4 summarizes the AUC values for each class across all models. Besides, the ROC curve is shown in the Fig. 7.

Table 4.

AUC values and 95% confidence intervals (CI) for each class across different models.

Model Cataract Diabetic Retinopathy Glaucoma Normal Average AUC
InceptionV3 1.00 [1.00, 1.00] 1.00 [1.00, 1.00] 1.00 [0.99, 1.00] 0.99 [0.99, 1.00] 1.00
ResNet50 1.00 [0.99, 1.00] 1.00 [1.00, 1.00] 0.99 [0.98, 0.99] 0.98 [0.97, 0.99] 0.99
Swin Transformer 0.99 [0.99, 1.00] 0.96 [0.93, 0.97] 0.97 [0.95, 0.98] 0.94 [0.91, 0.96] 0.97
ViT 1.00 [0.99, 1.00] 0.98 [0.96, 0.99] 0.99 [0.98, 1.00] 0.97 [0.96, 0.98] 0.99
LGSF-Net (Proposed) 1.00 [1.00, 1.00] 1.00 [1.00, 1.00] 1.00 [0.99, 1.00] 0.99 [0.99, 1.00] 1.00

Fig. 7.

Fig. 7

ROC comparison.

The proposed LGSF-Net achieved the highest overall performance, with an average AUC of 1.00, matching the performance of InceptionV319 but demonstrating more robust results across challenging classes. For Class 0 (Cataract), both LGSF-Net and InceptionV3 achieved perfect AUC values of 1.00 with 95% confidence intervals of [1.00, 1.00], demonstrating flawless discrimination between positive and negative samples. Similarly, Class 1 (Diabetic Retinopathy) showed an AUC of 1.00 for LGSF-Net, further highlighting its reliability in distinguishing this condition.

ResNet5020 and ViT21 also displayed strong performance, achieving average AUC values of 0.99. ResNet5020 demonstrated its effectiveness in Class 2 (Glaucoma), achieving an AUC of 0.99, although it fell slightly behind LGSF-Net for Class 3 (Normal), with an AUC of 0.98. ViT achieved similar results but exhibited slightly lower confidence intervals for certain classes. This nuance proves that ResNet’s deep networks work more effectively compared with ViT’s attention mechanism in terms of local characteristic noticing.

The Swin Transformer22, while performing well, lagged behind the other models, with an average AUC of 0.97. Its performance was particularly limited for Class 3 (Normal), where it achieved an AUC of 0.94 with a 95% confidence interval of [0.91, 0.96]. As we mentioned before, it may be caused by the disability of including global information into classification.

When analyzing the ROC curves, LGSF-Net consistently exhibited steep, left-leaning curves across all classes, reflecting its superior ability to balance true positive and false positive rates. This characteristic is particularly important for medical applications, where minimizing false negatives is critical to ensure accurate diagnoses. The consistency of LGSF-Net’s AUC values across all classes demonstrates its robustness and generalization capabilities, making it highly suitable for real-world medical imaging tasks.

The ROC and AUC analyses confirm that LGSF-Net outperforms baseline models by delivering superior classification performance with high precision and recall across all disease categories. These results reinforce the effectiveness of the proposed local-global multi-scale fusion strategy in addressing challenges specific to medical image classification.

Ablation study

Visualization of feature learning effects

To evaluate the significance of the local sensing block (CNN block) and global sensing block (transformer block) within LGSF-Net, an ablation study was conducted by systematically removing each component and observing the corresponding impact on model performance. The goal of this experiment was to validate the effectiveness of the proposed local-global feature fusion approach.

The results of the ablation experiment are illustrated in Figs. 8 ,9 and 10, which present the feature heat maps generated by the model in each configuration. These visualizations provide insight into the regions of the input images that the model deemed most significant for classification under different settings. More importantly, by comparing different feature heat maps, we are able to verify whether it is the proposed local-global fusion mechanism that determines the high performance of LGSF-Net.

Fig. 8.

Fig. 8

Feature heat map for LGSF-Net after ablating global sensing block.

Fig. 9.

Fig. 9

Feature heat map for LGSF-Net after ablating local sensing block.

Fig. 10.

Fig. 10

Feature heat map for complete LGSF-Net.

When the global sensing block was removed, the model’s ability to capture global contextual information was significantly impaired as shown in Fig. 8. Without transformer block which serves as understanding the global context while classifying, the model actually turns into a ResNet model20 with simple CNN and residual addition mechanism, which lacks the ability to precisely sense global relationships and information according to the analysis in Performance Comparison above. The heat map exhibited an over-reliance on localized features, resulting in a notable decline in performance. For example, in the heat map of feature 1, 10 and 13, most attention is concentrated on key local regions such as optic disc and central retinal vein. While for other feature heat maps, most attention focuses on the edge of retina. However, the attention is so concentrated on single regions, failing to covering global characteristics of the images. This highlights the critical role of global feature extraction in achieving robust classification, particularly for diseases with subtle global patterns, such as glaucoma.

Conversely, the removal of the local sensing block (Fig. 9 diminished the model’s capacity to focus on fine-grained details, such as capillary thickness and distribution. Without the CNN block and residual addition mechanism, the proposed model becomes a typical ViT21 which emphasizes learning form context. Typically shown in the heat maps of Feature 2, 6, 9, 10, 13, 14 and 15, a larger area around the optic disc, sometimes including the region of macula, are attached with relatively high importance to. This enables the model to classify eye disease according to the relationships between some key elements of retina. Moreover, compared to heat maps without without global sensing block, these heat maps distribute attention evenly to most part of the image as shown in pictures of feature 2, 3 and 5. The resulting heat map demonstrated a bias toward coarse, global patterns, which adversely affected the classification of diseases like diabetic retinopathy. The performance metrics for this configuration were significantly reduced according to the results of ViT, with an accuracy of 90%, an F1-score of 0.90, and an AUC of 0.99.

Finally, the feature heat map for the complete model shown in Fig. 10 highlights its ability to effectively capture both localized features, such as capillary structures, and global contextual information, such as fundus shape and overall texture. Specifically, the complete model not only highlights the key region of optical disc which is an essential lesions indicator for many retinal diseases, but also succeed in distributing attention to most global information. This comprehensive feature representation contributed to the highest classification performance, with an accuracy of 96%, an F1-score of 0.96, and an AUC of 1.00 across all classes. As a result, it is evident that the local-global scale fusion mechanism is the determinant of the high performance for the proposed model, which neither ablated localized or globalized model can accomplish.

Improvement of the learning effect of attention

Furthermore, we demonstrate the effectiveness of LGSF-Net for multi-scale fusion learning through the weights of the attention matrix, that is, the local perception ability of convolutional blocks can also improve the global learning ability of attention.

Figure 11 shows that LGSF-Net’s attention matrix weight values are more focused and effective, even when using single-stream concatenation, attention learning still does not have the effect of full model learning.

Fig. 11.

Fig. 11

The effect of attention learning. (1) The attention matrix obtained through training using only one Transformer block. (2) The attention matrix obtained through single-stream tandem training using convolutional blocks and Transformer blocks. (3), (4) The attention matrices of the two Transformer blocks of the entire LGSF-Net respectively.

The setting of transformer hyperparameters

The Transformer hyperparameter Settings in Table 1 are a lightweight and effective result of our efforts. When the number of channels can be divided evenly by the number of heads, increasing the number of attention heads can more completely capture the ability of context awareness. However, the improvement of perception ability does not necessarily significantly enhance the generalization ability of the model, and at the same time brings the risk of overfitting. Table 5 shows the accuracy rate of the test set under different numbers of attention heads.

Table 5.

Test-set accuracy under different numbers of attention heads(Other Settings remain as shown in Table 1).

Number of Heads Accuracy (%)
1 95.62
2 (default) 95.97
4 96.12
8 95.85

To a certain extent, increasing the number of attention heads can enhance the expression and generalization ability, achieving higher results when there are 4 heads. However, this improvement is not obvious and reduces the lightweight feature of the model.

The number of layers of the Transformer is also an important setting that affects the efficiency and generalization ability of the model. In our model, the number of layers of the two Transformer blocks is the same. The ablation research on the number of layers is shown in Table 6. Due to the limited computing resources, when using deeper Transformers, we reduced the batch size.

Table 6.

Test-set accuracy under different Transformer depths.

Number of Layers Batch Size Accuracy (%)
1 (default) 16 95.97
2 16 96.02
3 8 95.05

The Transformer block is the main source of parameters for LGSF-Net. Adding one layer of depth almost doubles the number of parameters and computational cost of the model, while the improvement in test accuracy is not obvious, which is due to overfitting.

Cross-Validation

To further verify the generalization performance of the model, we adopted 5-fold Cross-Validation. In each division, the dataset was randomly divided into 5 parts, among which 4 parts were used for training and 1 part for testing. Five rounds of experiments were completed in turn, and the final average value was taken as the comprehensive performance.

The experimental results are shown in the Table 7. It can be seen that the model performs stably among each fold, with an average accuracy rate of about 96% and a standard deviation of only 0.0545%, indicating that the model has good robustness and generalization ability.

Table 7.

Five-fold cross-validation result (Accuracy %).

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean ± Std
95.89 95.95 96.02 95.90 95.98 95.948 ± 0.0545

In fact, the training set of the initial experiment accounts for 80% of the dataset, and thus is similar to the result of 5 fold cross-validation.

Discussion

Research Questions (RQ) and outcomes

RQ1: Can LGSF-Net provide ideal deployment performance?

Outcome: Yes. LGSF-Net attains 96% test accuracy with only 18.7K parameters and 0.93 GFLOPs. LGSF-Net achieves state-of-the-art accuracy while reducing parameters and FLOPs, indicating suitability for resource-constrained clinical scenarios.

RQ2: Can the Pipeline of the proposed model make the learning of global and local features more significant?

Outcome: Yes. Feature maps show that proposed parallel local–global pipeline makes both types of features more salient. The attention-weight histograms also become sharper and more concentrated in the full model compared with single-stream model, indicating more decisive long-range interactions after fusing local evidence.

RQ3: Are the improvements stable across data splits and disease categories?

Outcome: Yes. Five-fold cross-validation yields Inline graphic accuracy with low variance; per-class precision/recall are consistently high and confusion matrices show low FN/FP across cataract, DR, and glaucoma, evidencing strong robustness and generalization.

Comprehensive evaluation

The comprehensive evaluation of each model is shown in Table 8, which comprehensively shows the superior performance of the proposed model. In summary, LGSF-Net represents a significant step forward in medical image classification, offering robust performance and computational efficiency. While further optimizations are necessary for real-world scalability, the proposed framework lays a solid foundation for future research in intelligent healthcare.

Table 8.

Comprehensive evaluation of each model.

Model Local Sensing Global Sensing High Comput. Eff. High Acc. High Generalization Capability
InceptionV3 Inline graphic Inline graphic
ResNet50 Inline graphic Inline graphic Inline graphic
Vit Inline graphic Inline graphic Inline graphic
Swin Inline graphic
LGSF-Net Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic

Limitation

As shown in Fig. 3, the dataset used contains fundus images of different forms and resolutions. However, the label categories of the dataset are balanced, which is indeed not challenging enough. Many experimental results can achieve relatively ideal values. We encourage the use of LGSF-Net to challenge imbalanced datasets of more categories.

Although LGSF-Net is already lightweight enough, the addition of the attention mechanism still causes computational overhead of square complexity, which limits the further expansion of the model’s feature dimensions48.

Future directions

By changing the output mode, LGSF-Net can accomplish some other machine vision tasks that have both global and local features simultaneously, such as image restoration49. It is also possible to increase the number of layers of the network and the number of attention heads simultaneously for training on large datasets such as ImageNet, serving as a pre-trained general backbone model23.

Conclusion

In this paper, aiming at the problem of automatic diagnosis and classification of fundus medical images, taking into account the complex characteristics of fundus diseases, the LGSF-Net model is proposed based on the concept of global and local information fusion. The proposed model was tested on publicly accessible datasets, and the results show that LGSF-Net has higher accuracy and computational efficiency than other models, and the performance of other medical professional indicators is also superior, which is more suitable for automatic medical diagnosis. At the same time, the ablation study shows the learning feature maps of each feature learning module, proves the effectiveness of the global and local information fusion understanding method, and explains the reasons for the advantages of LGSF-Net. Future work can apply the concept of global and local information fusion understanding to other suitable fields.

Author contributions

Ankang Lin performed all the work for the paper, including the method, experiments, visualization, writing, and final revisions.

Data availability

The data that support the findings of this study are openly available in kaggle at https://www.kaggle.com/datasets/gunavenkatdoddi/eye-diseases-classification. Moreover, the data can also be obtained by contacting the author to get.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Li, E. Y. et al. Prevalence of blindness and outcomes of cataract surgery in hainan province in south china. Ophthalmology120, 2176–2183. 10.1016/j.ophtha.2013.04.003 (2013). [DOI] [PubMed] [Google Scholar]
  • 2.Nath, T. et al. Prevalence of steroid-induced cataract and glaucoma in chronic obstructive pulmonary disease patients attending a tertiary care center in india. Asia-Pac. J. Ophthalmol.6, 28–32. 10.22608/APO.201616 (2017). [DOI] [PubMed] [Google Scholar]
  • 3.Tsiknakis, N. et al. Deep learning for diabetic retinopathy detection and classification based on fundus images: A review. Comput. Biol. Med.135, 104599. 10.1016/j.compbiomed.2021.104599 (2021). [DOI] [PubMed] [Google Scholar]
  • 4.Tham, Y.-C. et al. Global prevalence of glaucoma and projections of glaucoma burden through 2040: A systematic review and meta-analysis. Ophthalmology121, 2081–2090. 10.1016/j.ophtha.2014.05.013 (2014). [DOI] [PubMed] [Google Scholar]
  • 5.Kumari, P. & Saxena, P. Cataract detection and visualization based on multi-scale deep features by rinet tuned with cyclic learning rate hyperparameter. Biomed. Signal Process. Control87, 105452. 10.1016/j.bspc.2023.105452 (2024). [Google Scholar]
  • 6.Islam, M. M. et al. Predicting the risk of diabetic retinopathy using explainable machine learning algorithms. Diabetes Metab. Syndr.: Clin. Res. Rev.17, 102919. 10.1016/j.dsx.2023.102919 (2023). [DOI] [PubMed] [Google Scholar]
  • 7.Huang, C., Sarabi, M. & Ragab, A. E. Mobilenet-v2 /ifho model for accurate detection of early-stage diabetic retinopathy. Heliyon10, e37293. 10.1016/j.heliyon.2024.e37293 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Baudouin, C., Kolko, M., Melik-Parsadaniantz, S. & Messmer, E. M. Inflammation in glaucoma: From the back to the front of the eye, and beyond. Prog. Retin. Eye Res.83, 100916. 10.1016/j.preteyeres.2020.100916 (2021). [DOI] [PubMed] [Google Scholar]
  • 9.Balaha, H. M., Hassan, A.E.-S., Ahmed, R. A. & Balaha, M. H. Advancing eye disease detection: A comprehensive study on computer-aided diagnosis with vision transformers and shap explainability techniques. Biocybern. Biomed. Eng.45, 23–33. 10.1016/j.bbe.2024.11.005 (2025). [Google Scholar]
  • 10.Albuquerque, C., Henriques, R. & Castelli, M. Deep learning-based object detection algorithms in medical imaging: Systematic review. Heliyon11, e41137. 10.1016/j.heliyon.2024.e41137 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Kim, B., Zhuang, Y., Mathai, T. S. & Summers, R. M. Otmorph: Unsupervised multi-domain abdominal medical image registration using neural optimal transport. IEEE Trans. Med. Imaging44, 165–179. 10.1109/TMI.2024.3437295 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Shi, J. et al. A survey of label-noise deep learning for medical image analysis. Med. Image Anal.95, 103166. 10.1016/j.media.2024.103166 (2024). [DOI] [PubMed] [Google Scholar]
  • 13.Gao, Y., Zhang, J., Wei, S. & Li, Z. Pformer: An efficient cnn-transformer hybrid network with content-driven p-attention for 3d medical image segmentation. Biomed. Signal Process. Control101, 107154. 10.1016/j.bspc.2024.107154 (2025). [Google Scholar]
  • 14.Oliveira, G. C. et al. Robust deep learning for eye fundus images: Bridging real and synthetic data for enhancing generalization. Biomed. Signal Process. Control94, 106263. 10.1016/j.bspc.2024.106263 (2024). [Google Scholar]
  • 15.Khan, S. U. R. et al. Optimized deep learning model for comprehensive medical image analysis across multiple modalities. Neurocomputing619, 129182. 10.1016/j.neucom.2024.129182 (2025). [Google Scholar]
  • 16.Bhati, A., Gour, N., Khanna, P. & Ojha, A. Discriminative kernel convolution network for multi-label ophthalmic disease detection on imbalanced fundus image dataset. Comput. Biol. Med.153, 106519. 10.1016/j.compbiomed.2022.106519 (2023). [DOI] [PubMed] [Google Scholar]
  • 17.Toğaçar, M. Detection of retinopathy disease using morphological gradient and segmentation approaches in fundus images. Comput. Methods Programs Biomed.214, 106579. 10.1016/j.cmpb.2021.106579 (2022). [DOI] [PubMed] [Google Scholar]
  • 18.Al-Fahdawi, S. et al. Fundus-deepnet: Multi-label deep learning classification system for enhanced detection of multiple ocular diseases through data fusion of fundus images. Inf. Fusion102, 102059. 10.1016/j.inffus.2023.102059 (2024). [Google Scholar]
  • 19.Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2818–2826 (2016).
  • 20.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).
  • 21.Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  • 22.Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022 (2021).
  • 23.Zhu, L. et al. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417 (2024).
  • 24.Ranjbarzadeh, R. et al. Breast tumor localization and segmentation using machine learning techniques: Overview of datasets, findings, and methods. Comput. Biol. Med.152, 106443. 10.1016/j.compbiomed.2022.106443 (2023). [DOI] [PubMed] [Google Scholar]
  • 25.Vijayarajeswari, R., Parthasarathy, P., Vivekanandan, S. & Basha, A. A. Classification of mammogram for early detection of breast cancer using svm classifier and hough transform. Measurement146, 800–805. 10.1016/j.measurement.2019.05.083 (2019). [Google Scholar]
  • 26.Latif, J. et al. Enhanced nature inspired-support vector machine for glaucoma detection. Comput. Mater. Contin.76, 1151–1172. 10.32604/cmc.2023.040152 (2023). [Google Scholar]
  • 27.K, A. et al. Effect of multi filters in glucoma detection using random forest classifier. Meas.: Sens.25, 100566, 10.1016/j.measen.2022.100566 (2023).
  • 28.Hedberg-Buenz, A. et al. Quantitative measurement of retinal ganglion cell populations via histology-based random forest classification. Exp. Eye Res.146, 370–385. 10.1016/j.exer.2015.09.011 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Abraham, B. & Nair, M. S. Computer-aided diagnosis of clinically significant prostate cancer from mri images using sparse autoencoder and random forest classifier. Biocybern. Biomed. Eng.38, 733–744. 10.1016/j.bbe.2018.06.009 (2018). [Google Scholar]
  • 30.Riza Rizky, L. M. & Suyanto, S. Adversarial training and deep k-nearest neighbors improves adversarial defense of glaucoma severity detection. Heliyon8, e12275. 10.1016/j.heliyon.2022.e12275 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Cherif, W. Optimization of k-nn algorithm by clustering and reliability coefficients: application to breast-cancer diagnosis. Procedia Computer Science127, 293–299, 10.1016/j.procs.2018.01.125 (2018). Proceedings of the first international conference on intelligent computing in data sciences, ICDS2017.
  • 32.Jia, W. et al. Ankylosing spondylitis prediction using fuzzy k-nearest neighbor classifier assisted by modified jaya optimizer. Comput. Biol. Med.175, 108440. 10.1016/j.compbiomed.2024.108440 (2024). [DOI] [PubMed] [Google Scholar]
  • 33.Kommaraju, R. & Anbarasi, M. Diabetic retinopathy detection using convolutional neural network with residual blocks. Biomed. Signal Process. Control87, 105494. 10.1016/j.bspc.2023.105494 (2024). [Google Scholar]
  • 34.Tan, M. & Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, 6105–6114 (PMLR, 2019).
  • 35.Ravi, V., Acharya, V. & Alazab, M. A multichannel efficientnet deep learning-based stacking ensemble approach for lung disease detection using chest x-ray images. Clust. Comput.26, 1181–1203 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Saeedi, S., Rezayi, S., Keshavarz, H. & R. Niakan Kalhori, S. Mri-based brain tumor detection using convolutional deep learning methods and chosen machine learning techniques. BMC Med. Inform. Decis. Mak.23, 16 (2023). [DOI] [PMC free article] [PubMed]
  • 37.Chaitanya, K., Erdil, E., Karani, N. & Konukoglu, E. Contrastive learning of global and local features for medical image segmentation with limited annotations. Adv. Neural Inf. Process. Syst.33, 12546–12558 (2020). [Google Scholar]
  • 38.Heidari, M. et al. Computation-efficient era: A comprehensive survey of state space models in medical image analysis. arXiv preprint arXiv:2406.03430 (2024).
  • 39.Cui, Y. & Knoll, A. Enhancing local–global representation learning for image restoration. IEEE Trans. Ind. Inform.20, 6522–6530. 10.1109/TII.2023.3345464 (2024). [Google Scholar]
  • 40.Wang, T. et al. Gridformer: Residual dense transformer with grid structure for image restoration in adverse weather conditions. Int. J. Comput. Vis.132, 4541–4563. 10.1007/s11263-024-02056-0 (2024). [Google Scholar]
  • 41.Song, J. et al. Global and local feature reconstruction for medical image segmentation. IEEE Trans. Med. Imaging41, 2273–2284. 10.1109/TMI.2022.3162111 (2022). [DOI] [PubMed] [Google Scholar]
  • 42.Dong, A., Liu, J., Lv, G. & Cheng, J. Glmr-net: Global-to-local mutually reinforcing network for pneumonia segmentation and classification. Pattern Recognit.162, 111371. 10.1016/j.patcog.2025.111371 (2025). [Google Scholar]
  • 43.Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 (2019).
  • 44.Vaswani, A. e. a. Attention is all you need. Adv. Neural Inf. Process. Syst. (2017).
  • 45.Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2017).
  • 46.Xu, W., Fu, Y.-L. & Zhu, D. Resnet and its application to medical image processing: Research progress and challenges. Comput. Methods Programs Biomed.240, 107660. 10.1016/j.cmpb.2023.107660 (2023). [DOI] [PubMed] [Google Scholar]
  • 47.Pan, C., Chen, J. & Huang, R. Medical image detection and classification of renal incidentalomas based on yolov4+asff swin transformer. J. Radiat. Res. Appl. Sci.17, 100845. 10.1016/j.jrras.2024.100845 (2024). [Google Scholar]
  • 48.Dai, K. et al. Dsap: Dynamic sparse attention perception matcher for accurate local feature matching. IEEE Trans. Instrum. Meas.73, 1–16. 10.1109/TIM.2024.3370781 (2024). [Google Scholar]
  • 49.Wu, G., Jiang, J., Jiang, K., Liu, X. & Nie, L. Learning dynamic prompts for all-in-one image restoration. IEEE Trans. Image Process.34, 3997–4010. 10.1109/TIP.2025.3567205 (2025). [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data that support the findings of this study are openly available in kaggle at https://www.kaggle.com/datasets/gunavenkatdoddi/eye-diseases-classification. Moreover, the data can also be obtained by contacting the author to get.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES