Efficient fusion transformer model for accurate classification of eye diseases

Ankang Lin

doi:10.1038/s41598-025-20226-z

. 2025 Oct 16;15:36223. doi: 10.1038/s41598-025-20226-z

Efficient fusion transformer model for accurate classification of eye diseases

Ankang Lin ^1,^✉

PMCID: PMC12533095 PMID: 41102308

Abstract

The automatic diagnosis model of medical image based on deep learning can improve the diagnosis efficiency and reduce the diagnosis cost. At present, there is a lack of research on special artificial intelligence models for medical image analysis of fundus disease characteristics. Considering that fundus diseases have both local and global features, this paper proposes a novel deep learning model Local-Global Scale Fusion Network (LGSF-Net). The novelty lies in a dual-stream fusion design that processes global context (Transformer) and local details (CNN) in parallel with residual fusion. On the public fundus dataset, LGSF-Net delivers 96% accuracy with only 18.7K parameters and 0.93 GFLOPs, outperforming existing state-of-the-art universal methods like ResNet50 and ViT. LGSF-Net is more suitable for clinical diagnosis because of its accuracy and lightweight design. The ablation study shows that the concept of LGSF-Net multi-scale fusion understanding has been correctly realized. This work effectively promotes the development of smart medicine and provides a new solution for the design of new deep learning models.

Keywords: Local-Global Scale Fusion Network, Convolutional neural networks, Transformers, Fundus images, Medical image analysis.

Subject terms: Computer science, Software

Introduction

Millions of people worldwide suffer from various eye conditions, which are influenced by different components of the visual system. Some eye conditions can cause severe or even irreversible vision impairment. In particular, the rising prevalence of cataracts^1,2, diabetic retinopathy³, and glaucoma⁴ significantly contributes to global vision impairment, forming the focus of this research. Cataract is a visual impairment caused by the clouding of the eye’s crystalline lens. Early identification is crucial for cataract treatment⁵. Accurate classification in the early stages is invaluable for improving cataract treatment outcomes. Another common eye disease that causes vision loss is diabetic retinopathy (DR). DR is a retinal microvascular disease that can cause severe vision impairment or blindness in the diabetic population⁶. Symptoms of DR, such as blurred vision, difficulties with color perception, and eyeball floaters, can be so subtle that even a comprehensive ocular examination is required⁷. Moreover, glaucoma is a blinding ophthalmic disease characterized by optic nerve damage and defects in retinal ganglion cells, often associated with elevated intraocular pressure. Factors contributing to glaucoma include age-related oxidative stress and initial optic nerve damage⁸. These challenges underscore the difficulty of its classification in clinical practice.

Recent studies have highlighted imaging modalities as effective tools in medical imaging, especially for diagnosing eye conditions and diseases. For instance, Optical Coherence Tomography (OCT) provides insights into the retinal inner structure, facilitating the diagnosis of eye diseases such as cataracts and glaucoma. While OCT is prevalent in North America, hospitals worldwide, including those in the Asia-Pacific region, are increasingly adopting this technology to aid in diagnosing and treating eye diseases⁹. Other imaging techniques, including Magnetic Resonance Imaging (MRI), X-ray, and digital mammography, also contribute significantly to real-world diagnostic practices¹⁰. However, despite the improvements brought by traditional imaging technologies, the high costs of training qualified ophthalmologists and radiologists result in a lack of these resources, preventing many, especially in underdeveloped areas, from accessing accurate diagnoses at early stages. To address this issue, deep learning (DL) has been extensively deployed to assist in diagnosis by analyzing medical examination results¹¹, such as fundus images, using state-of-the-art models like Deep Neural Networks (DNNs)¹², Convolutional Neural Networks (CNNs), Vision Transformers¹³, and Generative Adversarial Networks (GANs)¹⁴. Deep learning not only helps mitigate resource shortages by enabling doctors to save interpretation time¹⁵, but also allows patients with limited access to medical resources to receive more accurate diagnoses.

However, some challenges remain for the application of deep learning in disease classification. Firstly, the lack of generalization, low classification accuracy, and computational inefficiency hinder deep learning from being practical enough for widespread hospital use. While some novel models achieve over 95 Inline graphic Area Under the Receiver Operating Curve (AUC)^16–18, few demonstrate comparable performance on new medical imaging modalities¹⁵. Moreover, data scarcity and the complexity of medical datasets pose significant challenges to further improving deep learning models for medical classification. Current models are typically trained on datasets much smaller than the practical scale, resulting in suboptimal performance. Furthermore, professional medical images contain complex nuances that significantly affect diagnosis. Consequently, this complexity detracts from the classification accuracy of deep learning models. Another major challenge is that recent research either focuses on detection based on local details or on global features, with few models attempting to integrate both types of information to enhance their performance.

Each specific medical image has its own prior features. As shown in Fig. 1, the fundus lesion has both global and local lesion features. Therefore, a dedicated deep learning model still needs to be designed for this prior. To solve the problem that existing models ignore multi-scale information fusion understanding, this paper proposes an innovative deep learning model Local-Global Scale Fusion Network (LGSF-Net) for classifying eye diseases. This model fuses local features with global information for understanding to identify eye diseases more accurately under the condition of limited computing resources.

The classification task for eye disease can be mathematically expressed as follows: given an input image Inline graphic , the model computes the probability of each class using the softmax function:

where Inline graphic is the output logit of class c from the model, and C is the total number of classes (e.g., cataract, diabetic retinopathy, glaucoma, and normal).

The predicted class Inline graphic is determined by:

which corresponds to the class with the highest probability.

Finally, the classification result can be represented as a one-hot vector Inline graphic , where:

This one-hot vector indicates the final classification output.

With the development of deep learning, many models with excellent feature representation ability have been proposed. Comparisons with other state-of-the-art models are operated in order to verify the accuracy and performance of LGSF-Net. In the experiment, InceptionV3¹⁹, ResNet50²⁰, Vision Transformer (ViT)²¹, Swin Transformer (Swin)²², and Vision Mamba (Vim)²³ will be trained under the same conditions as LGSF-Net. By comparing and analyzing metrics such as AUC, F-score, and the confusion matrix, the effectiveness of LGSF-Net will be thoroughly evaluated. The main contributions of this paper are summarized as follows:

A deep learning model LGSF-Net is constructed to integrate global and local multi-scale understanding for eye disease classification.
Through a series of comparative experiments, the model’s effective generalization performance and accuracy have been verified, advancing the development of intelligent healthcare.
A novel solution is provided for integrating different conceptual frameworks in deep learning models, offering new directions for future research.

In the following sections, we first provide a brief review of related work, and then provide the model architecture in the methodology section, comparing the conceptual differences with mainstream generic models. In the experiment section we give the implementation details of the model and main results, and the ablation study verifies the effectiveness of the proposed concept. The paper concludes after discussing the extensibility of method and limitations of this work.

Related work

Traditional machine learning methods for medical image classification

In traditional machine learning field, many supervised learning methods are proposed for classifying medical images. Since 2015, Support Vector Machine (SVM), Random Forest (RF) and K-nearest Neighbors (KNN) have prevailed in the field of automatic medical detection²⁴. These classification approaches focus on predicting the possibilities of diseases according to the features extracted from the medical images through large amount of data training and pattern recognition. First of all, due to its fast speed of calculation, several researchers utilized SVM into recognizing targeting medical images. Vijayarajeswari et al.²⁵ revised a new SVM classifier with two-dimensional transform for detecting breast cancer. Latif et al.²⁶ developed a new glaucoma classification approach, EGWO-SVM which can proceed extracted features better. However, SVM’s main drawback lies on its inefficiency while training with a large datasets, preventing further application considering the large amount of datasets required in clinical practice²⁴. In response to this problem, other researches adopt RF classifier which is more effective in handling large amount of data with nonlinear patterns in medical detection^27–29. Although random forest outperform in huge datasets, the architecture is too completed in some conditions that RF can not achieve satisfying processing rate. Another popular method is KNN, which focuses on the similarity of the input and trained data. It is widely used in improving confirmity by recognizing major diseases such as glaucoma³⁰, breast tumor³¹ and ankylosing spondylitis³².

Deep learning methods for medical image classification

CNN-based models have significantly advanced medical image classification by enabling the automatic learning of spatial hierarchies of features. ResNet²⁰, a widely adopted architecture, introduced residual learning to address the vanishing gradient issue, allowing the creation of deeper and more effective networks. This approach has proven beneficial in various applications, such as analyzing retinal fundus images to detect diabetic retinopathy^20,33. Similarly, EfficientNet, an architecture that optimally scales network dimensions through compound scaling, has demonstrated impressive accuracy in tasks like lung disease detection using chest X-rays^34,35. The combination of these architectures with preprocessing techniques, such as image augmentation and transfer learning, has further enhanced their performance in applications such as brain tumor classification using MRI scans³⁶. Despite these advancements, CNNs face challenges related to interpretation and computational efficiency, particularly when dealing with high-resolution medical images. This conclusion inspires us to further fuse CNN with other up-to-date model frameworks so that the advantages of CNN can be maintained while the disadvantages can be alleviated.

One significant limitation in the current literature is the absence of models specifically designed for medical imaging tasks, which requires much higher accuracy from large amounts of classification. While general-purpose architectures such as ResNet²⁰ and EfficientNet have shown effectiveness, they often fail to address the unique characteristics of medical datasets. These datasets frequently exhibit small sample sizes and high intra-class variability, which require domain-specific modifications. Incorporating prior anatomical knowledge or task-specific constraints into the models could significantly improve their performance¹⁵. Therefore, a new model which targets on the characteristics of medical images such as information variety should be taken into account in order to improve the accuracy of diagnosis.

Another under-explored area is the application of contrastive learning and state-space modeling techniques in medical imaging. Contrastive learning, which has shown promise in unsupervised learning for computer vision, could be highly beneficial in leveraging unannotated medical data to learn meaningful representations. This approach has the potential to improve classification performance while reducing the reliance on annotated data³⁷. Similarly, state-space modeling, commonly used in time-series analysis, could provide new opportunities for analyzing structural and sequential dependencies in medical images. These methods could be particularly useful for volumetric imaging datasets or longitudinal studies, enabling better modeling of disease progression patterns³⁸.

Local-global deep learning models

In recent years, multi-scale fusion understanding has been one of the important design concepts for improving deep learning models, aiming to consider both the global and local features of visual semantics simultaneously. Local-global deep learning models are widely used in general image restoration³⁹ and removal of specific image noise⁴⁰. This design concept is relatively simple to implement and can significantly improve the performance of the model, providing new improvement ideas for many computer vision tasks.

The medical images of many diseases also have the characteristics of multiple scales simultaneously. Therefore, some models for medical image analysis based on the concept of multi-scale fusion understanding have emerged^41,42. These works do not involve the analysis of eye diseases. The prior knowledge about eye diseases also indicates that fundus images will show lesion characteristics of different scales. The overall changes of the lens and retina need to be combined with local changes such as vascular margins to accurately diagnose the disease.

Methodology

Model architecture overview

We propose LGSF-Net (Local-Global Scale Fusion Network), a novel dual-stream architecture that effectively combines convolutional neural networks (CNNs) and transformers for retinal image analysis. As illustrated in Fig. 2, our model employs a parallel processing strategy that leverages both local feature extraction capabilities of CNNs and global context modeling of transformers. With parallel usage of CNN and transformer, this model is expected to manage notice local details such as features of capillaries and global information such as the color and shape of the fundus, which is highly suitable for retinal disease diagnosis. Despite the key modules of CNN and transformer, other popular techniques such as preprocessing, residual addition, average pooling and so on are also adopted into the architecture.

Fig. 2 — This figure illustrates the comprehensive architecture of our proposed LGSF-Net model. (a) Data preprocessing pipeline: The fundus image dataset undergoes initial preprocessing, followed by stratified partitioning into training, validation, and test sets. The training set is duplicated to enable parallel processing through the dual-stream network. (b) Overall model architecture: The dual-stream design processes the cloned inputs (), () through complementary pathways. The first stream processes data sequentially through ConvBlock followed by transformer modules, while the second stream reverses this order. The outputs from both streams are combined via element-wise addition to produce the final prediction. (c) ConvBlock structure: Each ConvBlock consists of three cascaded convolutional layers with ReLU activation functions, maintaining consistent feature dimensionality (2d) throughout. (d) transformer module architecture: The module implements a multi-head attention mechanism with three attention heads. The input features undergo parallel processing through Query (Q), Key (K), and Value (V) linear transformations, followed by scaled dot-product attention. The attention outputs are concatenated and processed through a feed-forward network with dropout regularization.

Inline graphic — This figure illustrates the comprehensive architecture of our proposed LGSF-Net model. (a) Data preprocessing pipeline: The fundus image dataset undergoes initial preprocessing, followed by stratified partitioning into training, validation, and test sets. The training set is duplicated to enable parallel processing through the dual-stream network. (b) Overall model architecture: The dual-stream design processes the cloned inputs (), () through complementary pathways. The first stream processes data sequentially through ConvBlock followed by transformer modules, while the second stream reverses this order. The outputs from both streams are combined via element-wise addition to produce the final prediction. (c) ConvBlock structure: Each ConvBlock consists of three cascaded convolutional layers with ReLU activation functions, maintaining consistent feature dimensionality (2d) throughout. (d) transformer module architecture: The module implements a multi-head attention mechanism with three attention heads. The input features undergo parallel processing through Query (Q), Key (K), and Value (V) linear transformations, followed by scaled dot-product attention. The attention outputs are concatenated and processed through a feed-forward network with dropout regularization.

Combining the local feature learning ability of convolutional blocks and the global feature learning ability of attention, a Pytorch⁴³ style implementation of LGSF-Net is shown in Algorithm 1. The core includes convolutional blocks (denoted as Inline graphic , details in 3.2) and Transformer blocks (denoted as , details in 3.3).

Algorithm 1 — Pytorch-style Pseudocode for the LGSF-Net model.

ConvBlock module

The ConvBlock, as shown in Fig. 2(c), consists of three cascaded 3 Inline graphic 3 convolutional layers with ReLU activation. Each convolution operation can be expressed as

where Inline graphic represents the convolutional kernel, denotes the convolution operation, and b is the bias term. The complete ConvBlock operation can be written as:

Transformer module

The transformer module (Fig. 2(d)) implements a multi-head attention mechanism with three attention heads. For an input feature map Inline graphic , the attention operation⁴⁴ is computed as

where Q, K, and V are linear projections of the input:

The multi-head attention is computed as:

where each head is computed independently using different learned projections. After that, a block of feed forward is added for normalizing the results, preparing for entering the next step. The feed forward step is composed with two repetitive combination of linear, ReLU activation and dropout layers. Each combination can be denoted as

where X is the output of attention part. The whole feed forward part can be expressed as

where Inline graphic is the final output of the transformer block. The transformer block hyperparameters we selected are shown in Table 1.

Table 1.

Hyperparameter settings of the transformer model.

Hyperparameter	Value
Tensor Channels	16
Transformer Depth	1
Attention Heads	2
Feedforward Dimension	64
Dropout Ratio for Training	0.1

Open in a new tab

Concepts and state-of-the-art models for comparison

The first concept is dual-stream processing. The parallel processing streams enable simultaneous capture of local and global features. The first stream (ConvBlock Inline graphic Transformer) emphasizes local feature extraction before global context modeling, while the second stream (Transformer ConvBlock) prioritizes global dependencies before local refinement. The combination of convolution network and transformer in this model boost improved performance in medical image classification.

For comparative evaluation, we benchmark LGSF-Net against state-of-the-art models including ResNet50²⁰, ViT²¹, Swin Transformer²², and Vision Mamba²³.

Experiments

Dataset

The dataset used in this study is publicly available medical images, sourced from our data availability statement, including fundus images and optical coherence tomography (OCT) scans. This dataset is specifically used to classify four types of eye diseases: cataract, diabetic retinopathy, glaucoma and normal. Figure 3 summarizes the class distribution of the dataset and shows the balanced representation of each category to ensure unbiased model training and evaluation.

Fig. 3 — Detailed information of the dataset. Each category contains different resolutions, formats and aspect ratios.

Implementation details

The experiments were conducted with a controlled setup to ensure consistency and reliability of the evaluation, while we preprocessed the data with some common methods. The detailed configuration is shown in Table 2. The deep learning framework Pytorch was used to implement our model. For the optimizer Adam, set the hyperparameters ( Inline graphic , ). For the training set, there is a 0.1 probability of using a random data augmentation. When the loss no longer decreases steadily, the learning rate is manually halved. When the loss on the validation set no longer decreases, the training is terminated.

Table 2.

Experiment setup and Hyperparameter configurations.

Parameter	Configuration
Optimizer	Adam⁴⁵
Initial Learning Rate	0.001
Batch Size	16
Number of Epochs	50
Weight Decay Regularization
Hardware Environment	Two NVIDIA RTX 3080 GPUs (20GB each),
	Intel Xeon Platinum 8352V CPU
Data Augmentation Techniques	Random rotation (), horizontal and vertical
	flipping, random cropping (90%), Gaussian noise
Image Resolution	pixels
Dataset Split Ratio	80% training, 10% validation, 10% testing
Loss Function	Cross Entropy Loss

Open in a new tab

Metrics

To comprehensively evaluate the performance of the proposed model, the following metrics were employed:

Accuracy: The overall classification accuracy is calculated as:
Precision: Precision measures the proportion of true positive predictions among all positive predictions:
Recall (Sensitivity): Recall evaluates the proportion of actual positives correctly identified:
Specificity: Specificity evaluates the proportion of actual negatives correctly identified:
F1-Score: The F1-Score represents the harmonic mean of precision and recall:
Area Under the Curve (AUC): AUC quantifies the ability of the model to distinguish between classes. It is derived from the Receiver Operating Characteristic (ROC) curve, which plots the true positive rate against the false positive rate at various threshold settings.
Receiver Operating Characteristic (ROC): The ROC curve visualizes the trade-off between sensitivity and specificity across different thresholds, providing a graphical representation of model performance.
Confidence Interval (CI): To quantify the reliability of the evaluation metrics, a 95% confidence interval is computed for each metric. The confidence interval for the accuracy is calculated using the following formula:
where is the observed accuracy, is the critical value for a 95% confidence level (), and is the total number of samples. Similar calculations are applied to precision, recall, and F1-Score, providing statistical insight into the stability and reliability of the results.

Computational complexity analysis

In order to gain intuitive understanding about the characteristics of LGSF-Net, we compute floating point operations (FLOPs), parameters and accuracy from LGSF-Net, ResNet50²⁰, InceptionV3¹⁹, ViT²¹, Swin Transformer²² and Vision Mamba²³. As shown in Fig. 4, LGSF-Net performs the best in all indicators. It is noteworthy that LGSF-Net’s scale of parameters is much smaller than any of the models, with 18,680 total parameters. Among the five baselien models, Vision Mamba and InceptionV3 are known for their efficient model architecture, both of which keep their parameter scales around 25 million. However, the proposed model manage to progressively reduce its parameters to 18,676, which is only the 0.000075 scale Vision Mamba and InceptionV3. Meanwhile, LGSF-Net still keeps the highest performance (95.97%). The performance of accuracy is 2.97% higher than ResNet50²⁰, which has the second best accuracy among all five models. In terms of floating point operations (FLOPs), the best baseline model is still InceptionV3, which only operates 5.69 GFlOPs. LGSF-Net is able to further improve on that base, reducing its floating operations to 0.93 GFLOPs. Therefore, by combining the advantages of CNN and transformer models, the proposed model is able to maintain high performance while keep the model small and the classification efficient, making it more suitable for target eye disease classification tasks compared to other popular models. The simple architecture with small GFLOPs and parameters also guarantee LGSF-Net’s potential application value, which can provide practical hospitals with efficient and economic solutions of diagnosing eye disease.

Fig. 4 — Computational complexity analysis compared with popular models.

Classification metrics comparison

The classification report of the proposed LGSF-Net was calculated on the test set and compared with the other five state-of-the-art models. Figure 5 summarize the classification metrics for each model.

Fig. 5 — Classification metrics comparison.

Among other models, ResNet50²⁰ demonstrated the best performance, achieving an accuracy of 94% and macro-average precision, recall, and F1-Score of 0.94. ResNet50’s deep learning framework is highly effective at detecting detailed features in fundus images. Its residual connections mitigate the vanishing gradient problem, enabling the network to learn intricate capillary details, such as position, thickness, and distribution, across multiple layers⁴⁶. Although ResNet50 requires longer training times due to its complexity²⁰, this model’s ability to focus on local features makes it highly effective for classifying eye diseases. This precision in capturing fine-grained details explains its strong performance across all disease categories.

The ViT²¹ also achieved competitive results, with an accuracy of 90% and a macro-average F1-Score of 0.90. The ViT leverages its attention mechanism to model both contextual and local relationships within the input features. This mechanism works particularly well with fundus images by integrating interrelated information, such as variations in capillary characteristics, and analyzing the relationships between local details. This ability to capture and analyze feature interdependencies explains its robust performance, although it falls slightly short of ResNet50²⁰ in terms of accuracy and F1-Score.

InceptionV3¹⁹ exhibited slightly lower performance, achieving an accuracy of 89%. Its relatively simple architecture, characterized by fewer parameters and reduced computational complexity, may limit its ability to process and retain the intricate and large-scale information present in fundus images. However, InceptionV3’s lightweight design and efficient operations make it suitable for resource-constrained clinical applications. Despite its lower performance compared to ResNet50²⁰ and ViT²¹, InceptionV3 remains a viable option in scenarios with limited computational budgets, offering an effective balance between performance and efficiency.

In contrast, Swin Transformer²² showed the lowest performance among the baseline models, achieving an accuracy of 81%. This suboptimal performance may be attributed to its small window partitioning mechanism, which segments input images into smaller patches and processes them locally. While this technique reduces computational overhead, it risks losing valuable relational information between patches, particularly in complex datasets like fundus images^22,47. Consequently, the Swin Transformer struggled to achieve the same level of accuracy as the other models, highlighting the importance of maintaining global contextual information in medical image classification tasks.

The proposed LGSF-Net outperformed all baseline models, achieving the highest accuracy of 96% on the test set, with macro-average precision, recall, and F1-Score values of 0.96. Notably, the model achieved a recall of 0.99 and an F1-Score of 0.99 for the diabetic retinopathy class, demonstrating its superior ability to handle this challenging disease. Furthermore, the model achieved precision and recall values above 0.92 for all classes, highlighting its robustness across different categories of eye diseases.

Confusion matrix

The confusion matrix provides a detailed breakdown of classification performance for the proposed LGSF-Net and five baseline models. The TP, FP, TN, FN indicators for each condition are summarized in Table 3. A detailed comparison of the confusion matrix is shown in Fig. 6.

Table 3.

TP, FP, TN, FN indicators comparison for all models.

Model	Cataract				Diabetic Retinopathy				Glaucoma
Model	TP	FP	TN	FN	TP	FP	TN	FN	TP	FP	TN	FN
InceptionV3	98	6	93	6	99	6	89	11	86	3	92	15
ResNet50	101	3	96	3	104	2	93	6	90	2	91	11
ViT	99	4	95	5	100	4	91	10	87	4	90	14
Swin Transformer	95	8	91	9	93	7	88	12	80	6	85	15
Vision Mamba	96	7	92	8	96	5	90	11	85	5	88	13
LGSF-Net (Proposed)	100	4	95	2	99	1	95	6	86	2	92	1

Open in a new tab

Fig. 6 — Confusion matrix comparison. For clear visualization, class codes are used to represent eye diseases. Class 0, 1, 2, 3 represent cataract, diabetic retinopathy, glaucoma and normal respectively.

LGSF-Net demonstrates superior performance in all three disease categories compared to the baseline models. For cataract classification, LGSF-Net achieved the highest number of true positives (100) and the lowest number of false negatives (2), outperforming all other models. While ResNet50²⁰ and ViT²¹ also demonstrated strong performance for cataract, both had higher false negative rates, with ResNet50²⁰ and ViT²¹ misclassifying 3 and 5 samples, respectively.

In the diabetic retinopathy category, LGSF-Net achieved the best overall performance with 99 true positives and only 1 false positive, reflecting a precision of 0.99 and a recall of 0.99. ResNet50²⁰, while competitive, achieved slightly fewer true positives (104) but had more false positives (2). Other models, including Swin Transformer²² and Vision Mamba²³, demonstrated lower recall rates due to higher false negative counts.

For glaucoma classification, which has the highest average FN number, LGSF-Net also outperformed the other models, achieving 86 true positives and the lowest false negative count (1), compared to ViT²¹ (87 true positives, 4 false negatives) and Swin Transformer (80 true positives, 6 false negatives). The model’s high true positive rate and minimal false negatives underscore its robustness in identifying glaucoma cases.

The confusion matrix results indicate that Swin Transformer²² and ViT²¹ fail to achieve the expected high performance, primarily due to their inability to effectively capture overall image features. These models exhibit the highest average false positive and false negative rates among all tested architectures. While their attention mechanisms provide notable improvements in understanding local information, they struggle to simultaneously interpret medical images at a global scale, a critical requirement for accurately classifying eye diseases. This limitation underscores the importance of integrating both local feature extraction and global contextual understanding within a single model. The superior performance of the proposed LGSF-Net, which fuses these two capabilities, further validates its effectiveness and highlights its potential as a robust solution for medical image classification tasks.

Across all categories, LGSF-Net consistently demonstrated better balance between sensitivity and specificity, as indicated by its minimal false positives and false negatives. Furthermore, the model maintained high performance even in challenging cases such as normal classification, where other models struggled with false negatives. This comparison highlights the effectiveness of LGSF-Net in addressing the limitations of baseline architectures and achieving state-of-the-art performance in medical image classification tasks.

ROC and AUC

The Receiver Operating Characteristic (ROC) curves and Area Under the Curve (AUC) values were analyzed to evaluate the classification performance of the proposed Local-Global Scale Fusion Network (LGSF-Net) and several baseline models, including InceptionV3¹⁹, ResNet50²⁰, ViT²¹, and Swin Transformer²². Table 4 summarizes the AUC values for each class across all models. Besides, the ROC curve is shown in the Fig. 7.

Table 4.

AUC values and 95% confidence intervals (CI) for each class across different models.

Model	Cataract	Diabetic Retinopathy	Glaucoma	Normal	Average AUC
InceptionV3	1.00 [1.00, 1.00]	1.00 [1.00, 1.00]	1.00 [0.99, 1.00]	0.99 [0.99, 1.00]	1.00
ResNet50	1.00 [0.99, 1.00]	1.00 [1.00, 1.00]	0.99 [0.98, 0.99]	0.98 [0.97, 0.99]	0.99
Swin Transformer	0.99 [0.99, 1.00]	0.96 [0.93, 0.97]	0.97 [0.95, 0.98]	0.94 [0.91, 0.96]	0.97
ViT	1.00 [0.99, 1.00]	0.98 [0.96, 0.99]	0.99 [0.98, 1.00]	0.97 [0.96, 0.98]	0.99
LGSF-Net (Proposed)	1.00 [1.00, 1.00]	1.00 [1.00, 1.00]	1.00 [0.99, 1.00]	0.99 [0.99, 1.00]	1.00

Open in a new tab

The proposed LGSF-Net achieved the highest overall performance, with an average AUC of 1.00, matching the performance of InceptionV3¹⁹ but demonstrating more robust results across challenging classes. For Class 0 (Cataract), both LGSF-Net and InceptionV3 achieved perfect AUC values of 1.00 with 95% confidence intervals of [1.00, 1.00], demonstrating flawless discrimination between positive and negative samples. Similarly, Class 1 (Diabetic Retinopathy) showed an AUC of 1.00 for LGSF-Net, further highlighting its reliability in distinguishing this condition.

ResNet50²⁰ and ViT²¹ also displayed strong performance, achieving average AUC values of 0.99. ResNet50²⁰ demonstrated its effectiveness in Class 2 (Glaucoma), achieving an AUC of 0.99, although it fell slightly behind LGSF-Net for Class 3 (Normal), with an AUC of 0.98. ViT achieved similar results but exhibited slightly lower confidence intervals for certain classes. This nuance proves that ResNet’s deep networks work more effectively compared with ViT’s attention mechanism in terms of local characteristic noticing.

The Swin Transformer²², while performing well, lagged behind the other models, with an average AUC of 0.97. Its performance was particularly limited for Class 3 (Normal), where it achieved an AUC of 0.94 with a 95% confidence interval of [0.91, 0.96]. As we mentioned before, it may be caused by the disability of including global information into classification.

When analyzing the ROC curves, LGSF-Net consistently exhibited steep, left-leaning curves across all classes, reflecting its superior ability to balance true positive and false positive rates. This characteristic is particularly important for medical applications, where minimizing false negatives is critical to ensure accurate diagnoses. The consistency of LGSF-Net’s AUC values across all classes demonstrates its robustness and generalization capabilities, making it highly suitable for real-world medical imaging tasks.

The ROC and AUC analyses confirm that LGSF-Net outperforms baseline models by delivering superior classification performance with high precision and recall across all disease categories. These results reinforce the effectiveness of the proposed local-global multi-scale fusion strategy in addressing challenges specific to medical image classification.

Ablation study

Visualization of feature learning effects

To evaluate the significance of the local sensing block (CNN block) and global sensing block (transformer block) within LGSF-Net, an ablation study was conducted by systematically removing each component and observing the corresponding impact on model performance. The goal of this experiment was to validate the effectiveness of the proposed local-global feature fusion approach.

The results of the ablation experiment are illustrated in Figs. 8 ,9 and 10, which present the feature heat maps generated by the model in each configuration. These visualizations provide insight into the regions of the input images that the model deemed most significant for classification under different settings. More importantly, by comparing different feature heat maps, we are able to verify whether it is the proposed local-global fusion mechanism that determines the high performance of LGSF-Net.

Fig. 8 — Feature heat map for LGSF-Net after ablating global sensing block.

Fig. 9 — Feature heat map for LGSF-Net after ablating local sensing block.

Fig. 10 — Feature heat map for complete LGSF-Net.

When the global sensing block was removed, the model’s ability to capture global contextual information was significantly impaired as shown in Fig. 8. Without transformer block which serves as understanding the global context while classifying, the model actually turns into a ResNet model²⁰ with simple CNN and residual addition mechanism, which lacks the ability to precisely sense global relationships and information according to the analysis in Performance Comparison above. The heat map exhibited an over-reliance on localized features, resulting in a notable decline in performance. For example, in the heat map of feature 1, 10 and 13, most attention is concentrated on key local regions such as optic disc and central retinal vein. While for other feature heat maps, most attention focuses on the edge of retina. However, the attention is so concentrated on single regions, failing to covering global characteristics of the images. This highlights the critical role of global feature extraction in achieving robust classification, particularly for diseases with subtle global patterns, such as glaucoma.

Conversely, the removal of the local sensing block (Fig. 9 diminished the model’s capacity to focus on fine-grained details, such as capillary thickness and distribution. Without the CNN block and residual addition mechanism, the proposed model becomes a typical ViT²¹ which emphasizes learning form context. Typically shown in the heat maps of Feature 2, 6, 9, 10, 13, 14 and 15, a larger area around the optic disc, sometimes including the region of macula, are attached with relatively high importance to. This enables the model to classify eye disease according to the relationships between some key elements of retina. Moreover, compared to heat maps without without global sensing block, these heat maps distribute attention evenly to most part of the image as shown in pictures of feature 2, 3 and 5. The resulting heat map demonstrated a bias toward coarse, global patterns, which adversely affected the classification of diseases like diabetic retinopathy. The performance metrics for this configuration were significantly reduced according to the results of ViT, with an accuracy of 90%, an F1-score of 0.90, and an AUC of 0.99.

Finally, the feature heat map for the complete model shown in Fig. 10 highlights its ability to effectively capture both localized features, such as capillary structures, and global contextual information, such as fundus shape and overall texture. Specifically, the complete model not only highlights the key region of optical disc which is an essential lesions indicator for many retinal diseases, but also succeed in distributing attention to most global information. This comprehensive feature representation contributed to the highest classification performance, with an accuracy of 96%, an F1-score of 0.96, and an AUC of 1.00 across all classes. As a result, it is evident that the local-global scale fusion mechanism is the determinant of the high performance for the proposed model, which neither ablated localized or globalized model can accomplish.

Improvement of the learning effect of attention

Furthermore, we demonstrate the effectiveness of LGSF-Net for multi-scale fusion learning through the weights of the attention matrix, that is, the local perception ability of convolutional blocks can also improve the global learning ability of attention.

Figure 11 shows that LGSF-Net’s attention matrix weight values are more focused and effective, even when using single-stream concatenation, attention learning still does not have the effect of full model learning.

Fig. 11 — The effect of attention learning. (1) The attention matrix obtained through training using only one Transformer block. (2) The attention matrix obtained through single-stream tandem training using convolutional blocks and Transformer blocks. (3), (4) The attention matrices of the two Transformer blocks of the entire LGSF-Net respectively.

The setting of transformer hyperparameters

The Transformer hyperparameter Settings in Table 1 are a lightweight and effective result of our efforts. When the number of channels can be divided evenly by the number of heads, increasing the number of attention heads can more completely capture the ability of context awareness. However, the improvement of perception ability does not necessarily significantly enhance the generalization ability of the model, and at the same time brings the risk of overfitting. Table 5 shows the accuracy rate of the test set under different numbers of attention heads.

Table 5.

Test-set accuracy under different numbers of attention heads(Other Settings remain as shown in Table 1).

Number of Heads	Accuracy (%)
1	95.62
2 (default)	95.97
4	96.12
8	95.85

Open in a new tab

To a certain extent, increasing the number of attention heads can enhance the expression and generalization ability, achieving higher results when there are 4 heads. However, this improvement is not obvious and reduces the lightweight feature of the model.

The number of layers of the Transformer is also an important setting that affects the efficiency and generalization ability of the model. In our model, the number of layers of the two Transformer blocks is the same. The ablation research on the number of layers is shown in Table 6. Due to the limited computing resources, when using deeper Transformers, we reduced the batch size.

Table 6.

Test-set accuracy under different Transformer depths.

Number of Layers	Batch Size	Accuracy (%)
1 (default)	16	95.97
2	16	96.02
3	8	95.05

Open in a new tab

The Transformer block is the main source of parameters for LGSF-Net. Adding one layer of depth almost doubles the number of parameters and computational cost of the model, while the improvement in test accuracy is not obvious, which is due to overfitting.

Cross-Validation

To further verify the generalization performance of the model, we adopted 5-fold Cross-Validation. In each division, the dataset was randomly divided into 5 parts, among which 4 parts were used for training and 1 part for testing. Five rounds of experiments were completed in turn, and the final average value was taken as the comprehensive performance.

The experimental results are shown in the Table 7. It can be seen that the model performs stably among each fold, with an average accuracy rate of about 96% and a standard deviation of only 0.0545%, indicating that the model has good robustness and generalization ability.

Table 7.

Five-fold cross-validation result (Accuracy %).

Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	Mean ± Std
95.89	95.95	96.02	95.90	95.98	95.948 ± 0.0545

Open in a new tab

In fact, the training set of the initial experiment accounts for 80% of the dataset, and thus is similar to the result of 5 fold cross-validation.

Discussion

Research Questions (RQ) and outcomes

RQ1: Can LGSF-Net provide ideal deployment performance?

Outcome: Yes. LGSF-Net attains 96% test accuracy with only 18.7K parameters and 0.93 GFLOPs. LGSF-Net achieves state-of-the-art accuracy while reducing parameters and FLOPs, indicating suitability for resource-constrained clinical scenarios.

RQ2: Can the Pipeline of the proposed model make the learning of global and local features more significant?

Outcome: Yes. Feature maps show that proposed parallel local–global pipeline makes both types of features more salient. The attention-weight histograms also become sharper and more concentrated in the full model compared with single-stream model, indicating more decisive long-range interactions after fusing local evidence.

RQ3: Are the improvements stable across data splits and disease categories?

Outcome: Yes. Five-fold cross-validation yields Inline graphic accuracy with low variance; per-class precision/recall are consistently high and confusion matrices show low FN/FP across cataract, DR, and glaucoma, evidencing strong robustness and generalization.

Comprehensive evaluation

The comprehensive evaluation of each model is shown in Table 8, which comprehensively shows the superior performance of the proposed model. In summary, LGSF-Net represents a significant step forward in medical image classification, offering robust performance and computational efficiency. While further optimizations are necessary for real-world scalability, the proposed framework lays a solid foundation for future research in intelligent healthcare.

Table 8.

Comprehensive evaluation of each model.

Model	Local Sensing	Global Sensing	High Comput. Eff.	High Acc.	High Generalization Capability
InceptionV3
ResNet50
Vit
Swin
LGSF-Net

Open in a new tab

Limitation

As shown in Fig. 3, the dataset used contains fundus images of different forms and resolutions. However, the label categories of the dataset are balanced, which is indeed not challenging enough. Many experimental results can achieve relatively ideal values. We encourage the use of LGSF-Net to challenge imbalanced datasets of more categories.

Although LGSF-Net is already lightweight enough, the addition of the attention mechanism still causes computational overhead of square complexity, which limits the further expansion of the model’s feature dimensions⁴⁸.

Future directions

By changing the output mode, LGSF-Net can accomplish some other machine vision tasks that have both global and local features simultaneously, such as image restoration⁴⁹. It is also possible to increase the number of layers of the network and the number of attention heads simultaneously for training on large datasets such as ImageNet, serving as a pre-trained general backbone model²³.

Conclusion

In this paper, aiming at the problem of automatic diagnosis and classification of fundus medical images, taking into account the complex characteristics of fundus diseases, the LGSF-Net model is proposed based on the concept of global and local information fusion. The proposed model was tested on publicly accessible datasets, and the results show that LGSF-Net has higher accuracy and computational efficiency than other models, and the performance of other medical professional indicators is also superior, which is more suitable for automatic medical diagnosis. At the same time, the ablation study shows the learning feature maps of each feature learning module, proves the effectiveness of the global and local information fusion understanding method, and explains the reasons for the advantages of LGSF-Net. Future work can apply the concept of global and local information fusion understanding to other suitable fields.

Author contributions

Ankang Lin performed all the work for the paper, including the method, experiments, visualization, writing, and final revisions.

Data availability

The data that support the findings of this study are openly available in kaggle at https://www.kaggle.com/datasets/gunavenkatdoddi/eye-diseases-classification. Moreover, the data can also be obtained by contacting the author to get.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Li, E. Y. et al. Prevalence of blindness and outcomes of cataract surgery in hainan province in south china. Ophthalmology120, 2176–2183. 10.1016/j.ophtha.2013.04.003 (2013). [DOI] [PubMed] [Google Scholar]
2.Nath, T. et al. Prevalence of steroid-induced cataract and glaucoma in chronic obstructive pulmonary disease patients attending a tertiary care center in india. Asia-Pac. J. Ophthalmol.6, 28–32. 10.22608/APO.201616 (2017). [DOI] [PubMed] [Google Scholar]
3.Tsiknakis, N. et al. Deep learning for diabetic retinopathy detection and classification based on fundus images: A review. Comput. Biol. Med.135, 104599. 10.1016/j.compbiomed.2021.104599 (2021). [DOI] [PubMed] [Google Scholar]
4.Tham, Y.-C. et al. Global prevalence of glaucoma and projections of glaucoma burden through 2040: A systematic review and meta-analysis. Ophthalmology121, 2081–2090. 10.1016/j.ophtha.2014.05.013 (2014). [DOI] [PubMed] [Google Scholar]
5.Kumari, P. & Saxena, P. Cataract detection and visualization based on multi-scale deep features by rinet tuned with cyclic learning rate hyperparameter. Biomed. Signal Process. Control87, 105452. 10.1016/j.bspc.2023.105452 (2024). [Google Scholar]
6.Islam, M. M. et al. Predicting the risk of diabetic retinopathy using explainable machine learning algorithms. Diabetes Metab. Syndr.: Clin. Res. Rev.17, 102919. 10.1016/j.dsx.2023.102919 (2023). [DOI] [PubMed] [Google Scholar]
7.Huang, C., Sarabi, M. & Ragab, A. E. Mobilenet-v2 /ifho model for accurate detection of early-stage diabetic retinopathy. Heliyon10, e37293. 10.1016/j.heliyon.2024.e37293 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Baudouin, C., Kolko, M., Melik-Parsadaniantz, S. & Messmer, E. M. Inflammation in glaucoma: From the back to the front of the eye, and beyond. Prog. Retin. Eye Res.83, 100916. 10.1016/j.preteyeres.2020.100916 (2021). [DOI] [PubMed] [Google Scholar]
9.Balaha, H. M., Hassan, A.E.-S., Ahmed, R. A. & Balaha, M. H. Advancing eye disease detection: A comprehensive study on computer-aided diagnosis with vision transformers and shap explainability techniques. Biocybern. Biomed. Eng.45, 23–33. 10.1016/j.bbe.2024.11.005 (2025). [Google Scholar]
10.Albuquerque, C., Henriques, R. & Castelli, M. Deep learning-based object detection algorithms in medical imaging: Systematic review. Heliyon11, e41137. 10.1016/j.heliyon.2024.e41137 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Kim, B., Zhuang, Y., Mathai, T. S. & Summers, R. M. Otmorph: Unsupervised multi-domain abdominal medical image registration using neural optimal transport. IEEE Trans. Med. Imaging44, 165–179. 10.1109/TMI.2024.3437295 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Shi, J. et al. A survey of label-noise deep learning for medical image analysis. Med. Image Anal.95, 103166. 10.1016/j.media.2024.103166 (2024). [DOI] [PubMed] [Google Scholar]
13.Gao, Y., Zhang, J., Wei, S. & Li, Z. Pformer: An efficient cnn-transformer hybrid network with content-driven p-attention for 3d medical image segmentation. Biomed. Signal Process. Control101, 107154. 10.1016/j.bspc.2024.107154 (2025). [Google Scholar]
14.Oliveira, G. C. et al. Robust deep learning for eye fundus images: Bridging real and synthetic data for enhancing generalization. Biomed. Signal Process. Control94, 106263. 10.1016/j.bspc.2024.106263 (2024). [Google Scholar]
15.Khan, S. U. R. et al. Optimized deep learning model for comprehensive medical image analysis across multiple modalities. Neurocomputing619, 129182. 10.1016/j.neucom.2024.129182 (2025). [Google Scholar]
16.Bhati, A., Gour, N., Khanna, P. & Ojha, A. Discriminative kernel convolution network for multi-label ophthalmic disease detection on imbalanced fundus image dataset. Comput. Biol. Med.153, 106519. 10.1016/j.compbiomed.2022.106519 (2023). [DOI] [PubMed] [Google Scholar]
17.Toğaçar, M. Detection of retinopathy disease using morphological gradient and segmentation approaches in fundus images. Comput. Methods Programs Biomed.214, 106579. 10.1016/j.cmpb.2021.106579 (2022). [DOI] [PubMed] [Google Scholar]
18.Al-Fahdawi, S. et al. Fundus-deepnet: Multi-label deep learning classification system for enhanced detection of multiple ocular diseases through data fusion of fundus images. Inf. Fusion102, 102059. 10.1016/j.inffus.2023.102059 (2024). [Google Scholar]
19.Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2818–2826 (2016).
20.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).
21.Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
22.Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022 (2021).
23.Zhu, L. et al. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417 (2024).
24.Ranjbarzadeh, R. et al. Breast tumor localization and segmentation using machine learning techniques: Overview of datasets, findings, and methods. Comput. Biol. Med.152, 106443. 10.1016/j.compbiomed.2022.106443 (2023). [DOI] [PubMed] [Google Scholar]
25.Vijayarajeswari, R., Parthasarathy, P., Vivekanandan, S. & Basha, A. A. Classification of mammogram for early detection of breast cancer using svm classifier and hough transform. Measurement146, 800–805. 10.1016/j.measurement.2019.05.083 (2019). [Google Scholar]
26.Latif, J. et al. Enhanced nature inspired-support vector machine for glaucoma detection. Comput. Mater. Contin.76, 1151–1172. 10.32604/cmc.2023.040152 (2023). [Google Scholar]
27.K, A. et al. Effect of multi filters in glucoma detection using random forest classifier. Meas.: Sens.25, 100566, 10.1016/j.measen.2022.100566 (2023).
28.Hedberg-Buenz, A. et al. Quantitative measurement of retinal ganglion cell populations via histology-based random forest classification. Exp. Eye Res.146, 370–385. 10.1016/j.exer.2015.09.011 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Abraham, B. & Nair, M. S. Computer-aided diagnosis of clinically significant prostate cancer from mri images using sparse autoencoder and random forest classifier. Biocybern. Biomed. Eng.38, 733–744. 10.1016/j.bbe.2018.06.009 (2018). [Google Scholar]
30.Riza Rizky, L. M. & Suyanto, S. Adversarial training and deep k-nearest neighbors improves adversarial defense of glaucoma severity detection. Heliyon8, e12275. 10.1016/j.heliyon.2022.e12275 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Cherif, W. Optimization of k-nn algorithm by clustering and reliability coefficients: application to breast-cancer diagnosis. Procedia Computer Science127, 293–299, 10.1016/j.procs.2018.01.125 (2018). Proceedings of the first international conference on intelligent computing in data sciences, ICDS2017.
32.Jia, W. et al. Ankylosing spondylitis prediction using fuzzy k-nearest neighbor classifier assisted by modified jaya optimizer. Comput. Biol. Med.175, 108440. 10.1016/j.compbiomed.2024.108440 (2024). [DOI] [PubMed] [Google Scholar]
33.Kommaraju, R. & Anbarasi, M. Diabetic retinopathy detection using convolutional neural network with residual blocks. Biomed. Signal Process. Control87, 105494. 10.1016/j.bspc.2023.105494 (2024). [Google Scholar]
34.Tan, M. & Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, 6105–6114 (PMLR, 2019).
35.Ravi, V., Acharya, V. & Alazab, M. A multichannel efficientnet deep learning-based stacking ensemble approach for lung disease detection using chest x-ray images. Clust. Comput.26, 1181–1203 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Saeedi, S., Rezayi, S., Keshavarz, H. & R. Niakan Kalhori, S. Mri-based brain tumor detection using convolutional deep learning methods and chosen machine learning techniques. BMC Med. Inform. Decis. Mak.23, 16 (2023). [DOI] [PMC free article] [PubMed]
37.Chaitanya, K., Erdil, E., Karani, N. & Konukoglu, E. Contrastive learning of global and local features for medical image segmentation with limited annotations. Adv. Neural Inf. Process. Syst.33, 12546–12558 (2020). [Google Scholar]
38.Heidari, M. et al. Computation-efficient era: A comprehensive survey of state space models in medical image analysis. arXiv preprint arXiv:2406.03430 (2024).
39.Cui, Y. & Knoll, A. Enhancing local–global representation learning for image restoration. IEEE Trans. Ind. Inform.20, 6522–6530. 10.1109/TII.2023.3345464 (2024). [Google Scholar]
40.Wang, T. et al. Gridformer: Residual dense transformer with grid structure for image restoration in adverse weather conditions. Int. J. Comput. Vis.132, 4541–4563. 10.1007/s11263-024-02056-0 (2024). [Google Scholar]
41.Song, J. et al. Global and local feature reconstruction for medical image segmentation. IEEE Trans. Med. Imaging41, 2273–2284. 10.1109/TMI.2022.3162111 (2022). [DOI] [PubMed] [Google Scholar]
42.Dong, A., Liu, J., Lv, G. & Cheng, J. Glmr-net: Global-to-local mutually reinforcing network for pneumonia segmentation and classification. Pattern Recognit.162, 111371. 10.1016/j.patcog.2025.111371 (2025). [Google Scholar]
43.Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 (2019).
44.Vaswani, A. e. a. Attention is all you need. Adv. Neural Inf. Process. Syst. (2017).
45.Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2017).
46.Xu, W., Fu, Y.-L. & Zhu, D. Resnet and its application to medical image processing: Research progress and challenges. Comput. Methods Programs Biomed.240, 107660. 10.1016/j.cmpb.2023.107660 (2023). [DOI] [PubMed] [Google Scholar]
47.Pan, C., Chen, J. & Huang, R. Medical image detection and classification of renal incidentalomas based on yolov4+asff swin transformer. J. Radiat. Res. Appl. Sci.17, 100845. 10.1016/j.jrras.2024.100845 (2024). [Google Scholar]
48.Dai, K. et al. Dsap: Dynamic sparse attention perception matcher for accurate local feature matching. IEEE Trans. Instrum. Meas.73, 1–16. 10.1109/TIM.2024.3370781 (2024). [Google Scholar]
49.Wu, G., Jiang, J., Jiang, K., Liu, X. & Nie, L. Learning dynamic prompts for all-in-one image restoration. IEEE Trans. Image Process.34, 3997–4010. 10.1109/TIP.2025.3567205 (2025). [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[CR1] 1.Li, E. Y. et al. Prevalence of blindness and outcomes of cataract surgery in hainan province in south china. Ophthalmology120, 2176–2183. 10.1016/j.ophtha.2013.04.003 (2013). [DOI] [PubMed] [Google Scholar]

[CR2] 2.Nath, T. et al. Prevalence of steroid-induced cataract and glaucoma in chronic obstructive pulmonary disease patients attending a tertiary care center in india. Asia-Pac. J. Ophthalmol.6, 28–32. 10.22608/APO.201616 (2017). [DOI] [PubMed] [Google Scholar]

[CR3] 3.Tsiknakis, N. et al. Deep learning for diabetic retinopathy detection and classification based on fundus images: A review. Comput. Biol. Med.135, 104599. 10.1016/j.compbiomed.2021.104599 (2021). [DOI] [PubMed] [Google Scholar]

[CR4] 4.Tham, Y.-C. et al. Global prevalence of glaucoma and projections of glaucoma burden through 2040: A systematic review and meta-analysis. Ophthalmology121, 2081–2090. 10.1016/j.ophtha.2014.05.013 (2014). [DOI] [PubMed] [Google Scholar]

[CR5] 5.Kumari, P. & Saxena, P. Cataract detection and visualization based on multi-scale deep features by rinet tuned with cyclic learning rate hyperparameter. Biomed. Signal Process. Control87, 105452. 10.1016/j.bspc.2023.105452 (2024). [Google Scholar]

[CR6] 6.Islam, M. M. et al. Predicting the risk of diabetic retinopathy using explainable machine learning algorithms. Diabetes Metab. Syndr.: Clin. Res. Rev.17, 102919. 10.1016/j.dsx.2023.102919 (2023). [DOI] [PubMed] [Google Scholar]

[CR7] 7.Huang, C., Sarabi, M. & Ragab, A. E. Mobilenet-v2 /ifho model for accurate detection of early-stage diabetic retinopathy. Heliyon10, e37293. 10.1016/j.heliyon.2024.e37293 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Baudouin, C., Kolko, M., Melik-Parsadaniantz, S. & Messmer, E. M. Inflammation in glaucoma: From the back to the front of the eye, and beyond. Prog. Retin. Eye Res.83, 100916. 10.1016/j.preteyeres.2020.100916 (2021). [DOI] [PubMed] [Google Scholar]

[CR9] 9.Balaha, H. M., Hassan, A.E.-S., Ahmed, R. A. & Balaha, M. H. Advancing eye disease detection: A comprehensive study on computer-aided diagnosis with vision transformers and shap explainability techniques. Biocybern. Biomed. Eng.45, 23–33. 10.1016/j.bbe.2024.11.005 (2025). [Google Scholar]

[CR10] 10.Albuquerque, C., Henriques, R. & Castelli, M. Deep learning-based object detection algorithms in medical imaging: Systematic review. Heliyon11, e41137. 10.1016/j.heliyon.2024.e41137 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Kim, B., Zhuang, Y., Mathai, T. S. & Summers, R. M. Otmorph: Unsupervised multi-domain abdominal medical image registration using neural optimal transport. IEEE Trans. Med. Imaging44, 165–179. 10.1109/TMI.2024.3437295 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Shi, J. et al. A survey of label-noise deep learning for medical image analysis. Med. Image Anal.95, 103166. 10.1016/j.media.2024.103166 (2024). [DOI] [PubMed] [Google Scholar]

[CR13] 13.Gao, Y., Zhang, J., Wei, S. & Li, Z. Pformer: An efficient cnn-transformer hybrid network with content-driven p-attention for 3d medical image segmentation. Biomed. Signal Process. Control101, 107154. 10.1016/j.bspc.2024.107154 (2025). [Google Scholar]

[CR14] 14.Oliveira, G. C. et al. Robust deep learning for eye fundus images: Bridging real and synthetic data for enhancing generalization. Biomed. Signal Process. Control94, 106263. 10.1016/j.bspc.2024.106263 (2024). [Google Scholar]

[CR15] 15.Khan, S. U. R. et al. Optimized deep learning model for comprehensive medical image analysis across multiple modalities. Neurocomputing619, 129182. 10.1016/j.neucom.2024.129182 (2025). [Google Scholar]

[CR16] 16.Bhati, A., Gour, N., Khanna, P. & Ojha, A. Discriminative kernel convolution network for multi-label ophthalmic disease detection on imbalanced fundus image dataset. Comput. Biol. Med.153, 106519. 10.1016/j.compbiomed.2022.106519 (2023). [DOI] [PubMed] [Google Scholar]

[CR17] 17.Toğaçar, M. Detection of retinopathy disease using morphological gradient and segmentation approaches in fundus images. Comput. Methods Programs Biomed.214, 106579. 10.1016/j.cmpb.2021.106579 (2022). [DOI] [PubMed] [Google Scholar]

[CR18] 18.Al-Fahdawi, S. et al. Fundus-deepnet: Multi-label deep learning classification system for enhanced detection of multiple ocular diseases through data fusion of fundus images. Inf. Fusion102, 102059. 10.1016/j.inffus.2023.102059 (2024). [Google Scholar]

[CR19] 19.Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2818–2826 (2016).

[CR20] 20.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).

[CR21] 21.Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[CR22] 22.Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022 (2021).

[CR23] 23.Zhu, L. et al. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417 (2024).

[CR24] 24.Ranjbarzadeh, R. et al. Breast tumor localization and segmentation using machine learning techniques: Overview of datasets, findings, and methods. Comput. Biol. Med.152, 106443. 10.1016/j.compbiomed.2022.106443 (2023). [DOI] [PubMed] [Google Scholar]

[CR25] 25.Vijayarajeswari, R., Parthasarathy, P., Vivekanandan, S. & Basha, A. A. Classification of mammogram for early detection of breast cancer using svm classifier and hough transform. Measurement146, 800–805. 10.1016/j.measurement.2019.05.083 (2019). [Google Scholar]

[CR26] 26.Latif, J. et al. Enhanced nature inspired-support vector machine for glaucoma detection. Comput. Mater. Contin.76, 1151–1172. 10.32604/cmc.2023.040152 (2023). [Google Scholar]

[CR27] 27.K, A. et al. Effect of multi filters in glucoma detection using random forest classifier. Meas.: Sens.25, 100566, 10.1016/j.measen.2022.100566 (2023).

[CR28] 28.Hedberg-Buenz, A. et al. Quantitative measurement of retinal ganglion cell populations via histology-based random forest classification. Exp. Eye Res.146, 370–385. 10.1016/j.exer.2015.09.011 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Abraham, B. & Nair, M. S. Computer-aided diagnosis of clinically significant prostate cancer from mri images using sparse autoencoder and random forest classifier. Biocybern. Biomed. Eng.38, 733–744. 10.1016/j.bbe.2018.06.009 (2018). [Google Scholar]

[CR30] 30.Riza Rizky, L. M. & Suyanto, S. Adversarial training and deep k-nearest neighbors improves adversarial defense of glaucoma severity detection. Heliyon8, e12275. 10.1016/j.heliyon.2022.e12275 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Cherif, W. Optimization of k-nn algorithm by clustering and reliability coefficients: application to breast-cancer diagnosis. Procedia Computer Science127, 293–299, 10.1016/j.procs.2018.01.125 (2018). Proceedings of the first international conference on intelligent computing in data sciences, ICDS2017.

[CR32] 32.Jia, W. et al. Ankylosing spondylitis prediction using fuzzy k-nearest neighbor classifier assisted by modified jaya optimizer. Comput. Biol. Med.175, 108440. 10.1016/j.compbiomed.2024.108440 (2024). [DOI] [PubMed] [Google Scholar]

[CR33] 33.Kommaraju, R. & Anbarasi, M. Diabetic retinopathy detection using convolutional neural network with residual blocks. Biomed. Signal Process. Control87, 105494. 10.1016/j.bspc.2023.105494 (2024). [Google Scholar]

[CR34] 34.Tan, M. & Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, 6105–6114 (PMLR, 2019).

[CR35] 35.Ravi, V., Acharya, V. & Alazab, M. A multichannel efficientnet deep learning-based stacking ensemble approach for lung disease detection using chest x-ray images. Clust. Comput.26, 1181–1203 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Saeedi, S., Rezayi, S., Keshavarz, H. & R. Niakan Kalhori, S. Mri-based brain tumor detection using convolutional deep learning methods and chosen machine learning techniques. BMC Med. Inform. Decis. Mak.23, 16 (2023). [DOI] [PMC free article] [PubMed]

[CR37] 37.Chaitanya, K., Erdil, E., Karani, N. & Konukoglu, E. Contrastive learning of global and local features for medical image segmentation with limited annotations. Adv. Neural Inf. Process. Syst.33, 12546–12558 (2020). [Google Scholar]

[CR38] 38.Heidari, M. et al. Computation-efficient era: A comprehensive survey of state space models in medical image analysis. arXiv preprint arXiv:2406.03430 (2024).

[CR39] 39.Cui, Y. & Knoll, A. Enhancing local–global representation learning for image restoration. IEEE Trans. Ind. Inform.20, 6522–6530. 10.1109/TII.2023.3345464 (2024). [Google Scholar]

[CR40] 40.Wang, T. et al. Gridformer: Residual dense transformer with grid structure for image restoration in adverse weather conditions. Int. J. Comput. Vis.132, 4541–4563. 10.1007/s11263-024-02056-0 (2024). [Google Scholar]

[CR41] 41.Song, J. et al. Global and local feature reconstruction for medical image segmentation. IEEE Trans. Med. Imaging41, 2273–2284. 10.1109/TMI.2022.3162111 (2022). [DOI] [PubMed] [Google Scholar]

[CR42] 42.Dong, A., Liu, J., Lv, G. & Cheng, J. Glmr-net: Global-to-local mutually reinforcing network for pneumonia segmentation and classification. Pattern Recognit.162, 111371. 10.1016/j.patcog.2025.111371 (2025). [Google Scholar]

[CR43] 43.Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 (2019).

[CR44] 44.Vaswani, A. e. a. Attention is all you need. Adv. Neural Inf. Process. Syst. (2017).

[CR45] 45.Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2017).

[CR46] 46.Xu, W., Fu, Y.-L. & Zhu, D. Resnet and its application to medical image processing: Research progress and challenges. Comput. Methods Programs Biomed.240, 107660. 10.1016/j.cmpb.2023.107660 (2023). [DOI] [PubMed] [Google Scholar]

[CR47] 47.Pan, C., Chen, J. & Huang, R. Medical image detection and classification of renal incidentalomas based on yolov4+asff swin transformer. J. Radiat. Res. Appl. Sci.17, 100845. 10.1016/j.jrras.2024.100845 (2024). [Google Scholar]

[CR48] 48.Dai, K. et al. Dsap: Dynamic sparse attention perception matcher for accurate local feature matching. IEEE Trans. Instrum. Meas.73, 1–16. 10.1109/TIM.2024.3370781 (2024). [Google Scholar]

[CR49] 49.Wu, G., Jiang, J., Jiang, K., Liu, X. & Nie, L. Learning dynamic prompts for all-in-one image restoration. IEEE Trans. Image Process.34, 3997–4010. 10.1109/TIP.2025.3567205 (2025). [DOI] [PubMed] [Google Scholar]

PERMALINK

Efficient fusion transformer model for accurate classification of eye diseases

Ankang Lin

Abstract

Introduction

Fig. 1.

Related work

Traditional machine learning methods for medical image classification

Deep learning methods for medical image classification

Local-global deep learning models

Methodology

Model architecture overview

Fig. 2.

Algorithm 1.

ConvBlock module

Transformer module

Table 1.

Concepts and state-of-the-art models for comparison

Experiments

Dataset

Fig. 3.

Implementation details

Table 2.

Metrics

Computational complexity analysis

Fig. 4.

Classification metrics comparison

Fig. 5.

Confusion matrix

Table 3.

Fig. 6.

ROC and AUC

Table 4.

Fig. 7.

Ablation study

Visualization of feature learning effects

Fig. 8.

Fig. 9.

Fig. 10.

Improvement of the learning effect of attention

Fig. 11.

The setting of transformer hyperparameters

Table 5.

Table 6.

Cross-Validation

Table 7.

Discussion

Research Questions (RQ) and outcomes

Comprehensive evaluation

Table 8.

Limitation

Future directions

Conclusion

Author contributions

Data availability

Declarations

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases