Abstract
Retinitis pigmentosa (RP) and Stargardt Disease (STGD) are inherited retinal diseases that can seriously affect vision. In this study, we present a novel, two-phase self-supervised learning method that addresses the challenge of limited labeled data in medical image analysis. In the first phase, the model learns useful visual features from a large collection of unlabeled retinal images using self-supervised training. In the second phase, it is fine-tuned on a smaller set of labeled images for the classification of RP and STGD. Experimental results demonstrate that using 5844 unlabeled and 782 labeled fundus images showed that our method, based on the EfficientNet-B1 architecture, outperforms state-of-the-art supervised learning methods, achieving 98.15% accuracy and 99.68% AUC. The proposed method is flexible and scalable, making it well-suited for real-world applications where labeled data is scarce.
Subject terms: Computational biology and bioinformatics, Diseases, Mathematics and computing, Medical research
Introduction
Inherited retinal diseases (IRDs), especially retinitis pigmentosa (RP) and Stargardt disease (STGD) seriously affect people’s vision. The retina undergoes structural changes due to these diseases that can be detected through advanced imaging methods. Various retinal images, including color fundus photographs (CFP), infrared (IR) images, and optical coherence tomography (OCT), are used to investigate eye diseases such as RP, diabetic retinopathy, STGD, and glaucoma1.
Despite advances in medical imaging and image processing, accurate diagnosis of RP and STGD remains a research topic in this field. Non-automated methods often rely on ophthalmologists’ assessments, which can be affected by factors such as fatigue and experience and can suffer from various factors that complicate the diagnosis process. Factors that complicate the diagnosis process include differences in disease progression and inherent variability in retinal images2,3. As shown in the study of Esengönül et al.1, artificial intelligence (AI), especially deep learning methods, has been able to achieve good accuracy in diagnosing diseases such as RP. However, these methods face several problems due to the lack of labeled datasets.
The challenge of obtaining enough labeled images to train AI models is a challenging issue in medical image processing. Many studies, including those by Davidson et al.4 and Miere et al.5, have shown that the performance of AI models is strongly dependent on the number and quality of training data. A review study by Esengonul et al.1 showed that most studies use small datasets, typically less than 2000 images, which can hamper the generalizability and performance of AI models. This limitation emphasizes the need for innovative strategies to enhance the training process.
Self-supervised learning methods have solved the challenge of lack of labeled data by using unlabeled data to improve the performance of models. This research used the self-supervised learning framework to address data challenge. Our proposed method works with CFP and IR images and uses the EfficientNet-B1 architecture as the backbone for network training. Our proposed method consists of two phases: Phase (1) This method is trained on a large dataset of unlabeled images. In this phase, the original image and the augmented image are used to train the network using a contrastive loss. Also, considering the similarity of the images of both eyes for each person, the image of the second pair of eyes is also used as the original image. Phase (2) The network is fine-tuned using labeled data in this phase. In this phase, the network is trained using the accurately labeled data to learn the final goal of the network, which is the diagnosis of RP and STGD.
For this study, we collected 5844 unlabeled images to train the initial model well in the first phase and made it a suitable and appropriate feature extractor. In the second phase, our model was trained using the labels for RP and STGD and included 782 images. In general, the innovations of this work can be considered as the following:
Improved accuracy in RP and STGD detection by using a self-supervised learning method instead of fully supervised methods.
Introducing a novel data augmentation method using paired left and right eye images as positive pairs during pre-training.
Providing a dataset of 5,844 fundus images for effective self-supervised pre-training in IRD detection.
Related works
In this section, we reviewed the research done about RP and STGD detection and the field of image classification, including the introduction of neural network architectures and deep learning methods.
RP and STGD detection
Recent research has demonstrated the effectiveness of image processing and AI methods in improving diagnosis of IRDs. The study by Davidson et al.4 used a Multidimensional Recurrent Neural Network (MDRNN) to detect cones in Adaptive Optics Scanning Light Ophthalmoscope (AOSLO) images for STGD, which resulted in a Dice score of 0.9431. Fujinami et al.2 utilized a deep learning architecture to achieve 90.9% accuracy in specific genetic marker predictions for macular dystrophy and RP patients using 178 SD-OCT images. Similarly, the study conducted by Charng et al.6 applied a ResNet-UNet architecture for the segmentation of hyperautofluorescent flecks in fundus autofluorescence images (FAF) of STGD and attained a Dice score of 0.71. Iadanza et al.7 used Support Vector Machine (SVM) to evaluate chromatic pupillometry data to diagnose RP early and obtained an accuracy result of 84.6%. Miere et al.8 used a ResNet101-based CNN model for classifying various IRDs like RP and STGD from FAF images, which resulted in an accuracy of 95%. The study by Shah et al.3 demonstrated 99.6% for OCT image classification of STGD patients. Lastly, Miere et al.5 used a pre-trained ResNet101 model to classify STGD using FAF images and obtained an accuracy of 87.30% with an AUC-ROC score of 0.981. Also in another study, Jafarbeglou et al.9 were able to achieve 96.3% accuracy in diagnosing RP and STGD by using CFP and IR images simultaneously. The research demonstrated that AI through deep learning methods can enhance diagnostic precision and process efficiency for IRDs. Jafarbeglou et al.9 evaluated a multi-input MobileNetV2 architecture combining CFP and IR images for the diagnosis of RP and STGD. Using a dataset of 391 cases, the model achieved 94.44% accuracy with each modality alone and 96.3% accuracy when both were used together, highlighting the benefit of multi-input deep learning methods for IRD detection.
More recently, several works have investigated advanced learning frameworks for the automated detection and staging of RP. Guven et al.10 proposed a GAN-based augmentation and transfer learning strategy using P1 wave amplitude maps from mfERG data, achieving 94.9% accuracy across four RP stages. Karaman et al.11 improved RP staging by integrating handcrafted features extracted from visual field grayscale and mfERG amplitude maps, where an SVM classifier reached 98.39% accuracy. Ferreira et al.12 further explored deep learning using FAF images to classify syndromic and non-syndromic RP across 66 genes, demonstrating the potential of AI-assisted imaging for genotype-phenotype correlation in RP diagnosis.
Image classification
Various neural network architectures have been introduced for image classification tasks. One of the first architectures introduced in this field is the AlexNet architecture. AlexNet, proposed by Krizhevsky et al.13 in 2012, made a leap in neural networks and computer vision. It included five convolutional layers and three fully connected layers, using Rectified Linear Units (ReLU) as an activation function. AlexNet introduced overlapping max-pooling layers and dropout layers to prevent overfitting. VGG16, introduced by Simonyan et al.14, improved image classification by using a deep architecture of 16 layers: 13 convolutional and three fully connected. By implementing 3x3 convolution filters across the network, VGG16 allowed the creation of deeper architectures with more parameters. The network design enhanced its feature extraction capabilities which resulted in superior ImageNet challenge performance.
ResNet was introduced in 2015 by He et al.15 to improve deep learning by solving the degradation problem in deep networks. ResNet implemented residual learning with skip connections to train networks’ identity mappings, enabling them to bypass certain layers. ResNet uses skip connections to address the vanishing gradient issue. InceptionV3 introduced by Szegedy et al.16 presents an inception architecture developed to improve computational efficiency while preserving top performance levels. The Inception V3 module employs multiple convolutional filter sizes to enable multi-scale feature extraction in a single layer and uses factorized convolutions to split bigger convolution filters into smaller components which decrease both parameter count and computational expenses.
MobileNetV1 was proposed by Howard et al.17. In 2017, MobileNetV1 appeared with a design focus on mobile and embedded vision applications. Depth-wise separable convolutions form the basis of the architecture by splitting standard convolutions into two stages of depth-wise and pointwise convolutions to achieve lower parameter usage and computational expense. MobileNetV2 was proposed by Sandler et al.18. This novel architecture preserves MobileNetV1’s depth-wise separable convolutions efficiency yet brings several significant improvements. The architecture features residual blocks that employ narrow bottleneck layers that expand inversely during intermediate processing stages. The architectural improvements enhance feature extraction and improve gradient flow.
Tan et al.19 proposed EfficientNet, which optimizes network depth, width, and resolution simultaneously using compound scaling, balancing complexity and performance efficiently. It employs depth-wise separable convolutions to reduce parameters and computational demands and the Swish activation function for smoother gradients and better optimization. The architecture features mobile inverted bottleneck convolution (MBConv) blocks with squeeze-and-excitation optimization, enhancing representational capacity.
Learning methods
Neural network architectures can be trained using multiple methods which comprise supervised learning20, unsupervised learning21, and self-supervised learning (SSL)22,23. These methods primarily differ in whether they require labeled datasets. Any technique that reduces the need for labeled information while keeping high accuracy levels offers significant advantages since labeling data creates numerous difficulties.
One such method is self-supervised learning. Self-supervised learning represents a deep learning technique that enables model training without relying on labeled data. The proposed method trains the neural network backbone with unlabeled data during the initial phase to develop feature extraction capabilities, followed by fine-tuning the network during the subsequent phase for the specific task. Self-supervised learning combines the strengths of supervised learning with those of unsupervised learning to create a unified learning method. Self-supervised learning surpasses both supervised and unsupervised learning because it scales effectively and provides cost-efficient operations alongside generalizable abilities.
Method
The proposed method uses a two-phase learning method to improve retinal image classification, as depicted in Fig. 1. This method improves the model’s ability to accurately classify different retinal diseases through a training process divided into self-supervised pre-training and fine-tuning.
Figure 1.
Training is performed using unlabeled dataset based on contrastive loss, and in the second phase, training is performed using a dataset including RP and STGD diseases and healthy individuals.
The self-supervised pre-training phase of the training process is divided into two separate parts.
(1) The initial part of phase one involves self-supervised pre-training with positive and negative retinal image pairs. Positive image pairs for the model contain photographs of an identical person. The pair consists of an image from one eye or augmented image alongside an image of the other eye from the same person. Negative pairs involve comparing eye images from different people that have been chosen at random. This procedure enables visual feature representation learning through image feature extraction and comparison without requiring labeled data. The section applies contrastive loss (Eq 1) as its loss function, which directs the model to reduce distances between positive pairs while expanding distances between negative pairs.
(2) During the second part of the first phase, the deep learning model learns from both left-eye images and right-eye images. The primary objective of this phase is to train the model to differentiate between left and right eyes, which helps it learn the features of the differences between them. The model learns progressively detailed visual features through this training stage, which is essential for acquiring fundamental visual distinctions needed in the second phase.
![]() |
1 |
where
is the Euclidean distance between the feature representations of the paired images,
indicates whether the pair is positive (1) or negative (0), and
is a margin that defines how far apart negative pairs should be.
![]() |
2 |
where
represents the true labels and
denotes the predicted probabilities for each class.
The model trained in the first phase is used in the second phase. In this phase, the model is trained using a labeled dataset. In this phase, three image categories, RP, STGD, and Healthy, are used as input data to train the network, and the network is trained using two FC layers added in this phase and acquires the ability to classify input images. Also, the input data in this phase are CFP images. After completing this phase, the network can classify images and diagnose RP and STGD diseases. The network used in this method is the EfficientNetB1 network, and the Cross-Entropy24 loss function , according to Equation 2, is also used to train the network in the second phase.
Overall, in the proposed method, in the first phase, the network is trained using an unlabeled dataset to create a suitable feature extractor that can understand the images, and in the second phase, the network is trained with classification layers using the target dataset for the study, which is for diagnosing RP, STGD, and Healthy.
Dataset
The study dataset includes 5,844 unlabeled images with annotations that show if each image represents the left eye or the right eye. The dataset lacks annotations about the health status of subjects or the presence of retinal diseases. Alongside this, there are 782 labeled images that include specific disease classifications: The labeled image set consists of 316 RP images, 124 STGD images and 342 images from normal subjects. The dataset is partitioned into five distinct folds which facilitates a thorough evaluation of the model’s performance on various data segments. Furthermore, during the training process, The dataset was randomly split into 80% training and 20% testing sets. The method allows for a balanced training regimen and sufficient evaluation data to assess model generalization capabilities.
The demographic characteristics of our labeled dataset, sourced from9, are presented in Table 1. The labeled dataset comprises 782 labeled fundus images with balanced age and gender distributions across all diagnostic categories. Statistical analysis revealed no significant differences in age distribution (RP: 40 ± 13 years, STGD: 30 ± 12 years, Healthy: 38 ± 12 years,
) or gender distribution (Male: 47%, 44%, 43% for RP, STGD, and Healthy groups respectively,
), indicating an unbiased dataset composition suitable for robust model training and evaluation. All 5,844 unlabeled fundus images used for self-supervised pre-training were obtained from the IRDReg®Registry under the same ethical approval code (IR.SBMU.ORC.REC.1396.15) as the labeled dataset.
Table 1.
Age and gender distribution of the dataset.
| Factors | Level | RP | STGD | Healthy eyes | P-value |
|---|---|---|---|---|---|
| Age (years) | Mean | 40 ± 13 | 30 ± 12 | 38 ± 12 | 0.41 |
| Median (min, max) | 40 (7, 74) | 31 (6, 63) | 34 (6, 72) | ||
| Gender (%) | Male | 74 (47%) | 27 (44%) | 73 (43%) | 0.14 |
| Female | 84 (53%) | 35 (56%) | 98 (57%) |
Figure 2 illustrates representative samples from the three diagnostic categories used in this study. The dataset includes six examples from each class (Normal, RP, and STGD) demonstrating the intra-class variability in color contrast, vessel clarity, and macular appearance. These variations highlight the heterogeneity of the dataset, which poses a realistic challenge for robust model generalization and motivates the use of self-supervised feature learning prior to supervised fine-tuning.
Figure 2.
Representative fundus image samples from each class within the experimental dataset. Top row: Normal controls; middle: RP cases; bottom: STGD cases. Each class contains six representative samples illustrating the visual diversity within the same diagnostic category.
Experiments
In this section, we describe the experimental setup and results that demonstrate the effectiveness of the proposed method for diagnosing RP and STGD.
Experimental setting
Implementation details
The implementation phase applies multiple data augmentation techniques to produce a more diverse training dataset and improve the model’s robustness. Data augmentation for image processing includes operations like flipping (up-down and left-right), scaling (zooming in and out), translating (shifting the image), and HSV color space adjustments.
The training process applies Stochastic Gradient Descent (SGD)25 using 0.9 momentum to enhance convergence speed and stabilize training. The preprocessing stage makes network inputs uniform by adjusting images to 224x224 pixels through resizing and cropping to maintain uniform image dimensions.
To ensure robust evaluation of the proposed method, a 5-fold cross-validation strategy was used. The dataset was divided into 5 equal folds, and the model was trained and evaluated 5 times, with each fold serving as the validation set once and the remaining 4 folds used for training. The reported metrics, including accuracy, precision, recall, F1-score, specificity, and AUC, represent the mean values across the 5 folds, and the standard deviation reflects the variability of the results across these folds
Evaluation metrics
A comprehensive performance evaluation of the model requires multiple specific evaluation metrics. Accuracy determines the proportion of true positives and true negatives in relation to the complete number of cases that were examined. When accuracy levels are high it shows that the model can accurately differentiate between healthy and diseased images which helps precisely detect conditions such as STGD and RP . The definition of the Accuracy formula is given in Eq. (3). In this equation, TP means True Positive, TN means True Negative, FP means False Positive, and FN means False Negative.
![]() |
3 |
Precision shows how many true positive results are present in relation to all positive predictions by the model. Medical diagnostic processes must maintain high precision to reduce false positives which prevents healthy patients from being wrongly diagnosed with STGD or RP. The formula for calculating precision is defined in Eq. (4).
![]() |
4 |
Recall (Eq. (5)) calculates how many actual positive cases the model successfully identified. Detecting conditions such as RP and STGD requires high recall rates to ensure that most patients with these diseases receive proper identification and reduce the chances of missed cases.
![]() |
5 |
The F1 score represents the harmonic mean between precision and recall which combines these metrics into a single balanced measure. The F1 score shows strong performance of the model for detecting RP and STGD patients and minimizing false positives which makes it an essential metric in clinical applications. The definition of the F1 formula is given in Eq. (6).
![]() |
6 |
The metric of specificity (Eq. 7) calculates how many true negative results exist within all negative outcomes. The accurate identification of healthy people through high specificity is essential because it prevents unnecessary anxiety and further testing for individuals who do not have RP or STGD.
![]() |
7 |
Negative Predictive Value (NPV) represents the fraction of true negative results out of all negative test results. In healthcare environments a strong NPV ensures patient trust by confirming negative test outcomes for RP and STGD accurately indicate absence of the diseases. The definition of the NPV formula is given in Eq. (8).
![]() |
8 |
Experimental results
Overall classification performance
Table 2 demonstrates that our self-supervised learning method surpasses the supervised methd in all evaluation metrics and across various neural network structures. When applied to the EfficientNet-B1 architecture self-supervised learning produces notable performance gains as accuracy advances from 93.83 to 98.15%, Balanced Accuracy (BA climbs from 93.09 to 98.11%, and AUC moves from 98.39 to 99.68%. Our method demonstrates exceptional skill in feature extraction while maintaining elevated performance levels across complicated and skewed datasets.
Table 2.
Comparison of supervised and self-supervised learning performance across multiple model architectures for RP and STGD classification.
| Model | Method | Accuracy | BA | Precision | F1 | Specificity | NPV | AUC |
|---|---|---|---|---|---|---|---|---|
| MnasNet-A128 | Supervised | 86.42 ± 1.56 | 80.7 | 88.53 | 84.23 | 91.3 | 96.47 | 93.84 |
| Our Self-Supervised | 91.36 ± 0.94 | 89.58 | 91.2 | 91.15 | 95.24 | 96.2 | 97.12 | |
| AlexNet13 | Supervised | 87.04 ± 1.86 | 83.37 | 87.13 | 86.18 | 92.35 | 95.06 | 94.06 |
| Our Self-Supervised | 93.21 ± 1.86 | 91.1 | 93.23 | 92.96 | 96.04 | 97.5 | 99.09 | |
| ShuffleNetV2-1x29 | Supervised | 90.12 ± 1.24 | 88.97 | 90.12 | 90.12 | 95.26 | 95.26 | 98.22 |
| Our Self-Supervised | 95.68 ± 0.62 | 96.09 | 96.06 | 95.73 | 98.44 | 97.38 | 99.12 | |
| Inception-V316 | Supervised | 92.59 ± 1.24 | 90.22 | 92.72 | 92.36 | 95.54 | 97.32 | 98.39 |
| Our Self-Supervised | 96.91 ± 0.62 | 96.36 | 96.91 | 96.9 | 98.35 | 98.58 | 99.47 | |
| ResNet5015 | Supervised | 93.21 ± 0.94 | 93.33 | 93.75 | 93.27 | 93.37 | 96.24 | 98.39 |
| Our Self-Supervised | 96.91 ± 0.94 | 95.61 | 97.13 | 96.86 | 97.88 | 99.09 | 99.79 | |
| VGG1614 | Supervised | 94.44 ± 0.94 | 92.85 | 94.56 | 94.35 | 96.65 | 97.83 | 98.83 |
| Our Self-Supervised | 97.53 ± 0.34 | 97.61 | 97.6 | 97.54 | 99.01 | 98.56 | 99.41 | |
| DenseNet12130 | Supervised | 93.83 ± 0.62 | 92.72 | 93.76 | 93.74 | 96.7 | 97.49 | 97.49 |
| Our Self-Supervised | 97.53 ± 0.62 | 96.49 | 97.67 | 97.5 | 98.3 | 99.27 | 98.74 | |
| MobileNetV218 | Supervised | 94.44 ± 0.94 | 93.15 | 94.38 | 94.37 | 97.05 | 97.68 | 98.83 |
| Our Self-Supervised | 98.15 ± 0.34 | 98.11 | 98.17 | 98.15 | 99.2 | 98.97 | 99.71 | |
| EfficientNet-B119 | Supervised | 93.83 ± 0.34 | 93.09 | 93.76 | 93.78 | 96.86 | 96.96 | 98.39 |
| Our Self-Supervised | 98.15 ± 0.62 | 98.11 | 98.17 | 98.15 | 99.2 | 98.97 | 99.68 |
The supervised methods were trained using 782 labeled images, whereas the proposed self-supervised framework utilized 5844 unlabeled and 782 labeled images. Metrics include Accuracy, BA, Precision, F1-score, Specificity, NPV, and AUC. The self-supervised method shows significant performance improvements across all architectures, highlighting its robustness and adaptability.
The self-supervised learning method shows substantial performance gains across multiple architectures which proves the method’s robustness. MobileNetV2 shows an accuracy jump from 94.44 to 98.15% while DenseNet121 exhibits a precision increase from 93.76 to 97.67%. The self-supervised learning framework demonstrates consistent superiority in medical image classification through significant improvements in balanced accuracy and specificity for both ResNet50 and InceptionV3 architectures.
EfficientNet-B1 shows the best performance in terms of accuracy and AUC yet all tested models demonstrate improved results with the self-supervised method versus traditional supervised learning according to Table 2. Results show the proposed method can scale and adapt to various situations making it an effective method for tackling diagnostic difficulties in retinal disorders (RP and STGD).
Our innovative self-supervised learning method demonstrates remarkable superiority in RP and STGD classification compared to supervised method, as clearly illustrated in Fig. 4. The ROC curve comparison reveals our self-supervised model achieves perfect classification for Normal cases (AUC = 1.0000) and exceptional performance for both RP (AUC = 0.9961) and Stargardt disease (AUC = 0.9945), consistently outperforming the supervised method (AUC = 0.9998, 0.9766, and 0.9730 respectively). This distinct advantage is further evidenced in Fig. 5, where the macro-average AUC of 0.9968 highlights our method’s robust diagnostic capability across all pathological categories with unprecedented confidence levels, establishing a new benchmark in retinal disease classification performance.
Figure 4.
(ROC) curves comparing Self-supervised and Supervised learning methods across three diagnostic categories (Normal, RP, and Stargardt disease). The Self-supervised method demonstrates superior performance with AUC values of 1.0000 (Normal), 0.9961 (RP), and 0.9945 (STGD), outperforming the Supervised method across all disease categories.
Figure 5.

Bar chart illustrating the Area Under the ROC Curve (AUC) scores for both classification methods. The Self-supervised method achieves higher AUC values for all three diagnostic categories and demonstrates superior overall performance with a macro-average AUC of 0.9968 compared to 0.9831 for the Supervised method.
The confusion matrix in Fig. 3 demonstrates the classification performance of our proposed method on the test dataset. The model achieved perfect classification for Normal cases (100% accuracy, 58/58), excellent performance for RP detection (96.97% accuracy, 64/66 with only 2 misclassifications as Stargardt), and strong performance for Stargardt classification (97.37% accuracy, 37/38 with 1 misclassification as RP). Notably, there were no false positives for Normal cases, indicating the model’s reliability in distinguishing pathological from healthy retinal conditions. The primary confusion occurred between RP and Stargardt diseases (3 total misclassifications), which is clinically understandable given potential similarities in fundus appearance between these two retinal dystrophies.
Figure 3.

Confusion matrix showing the classification performance of our proposed method on the test dataset. The matrix displays both absolute counts and percentages for each class prediction, demonstrating excellent discrimination between Normal, RP, and Stargardt cases.
The precision-recall analysis in Fig. 6 further validates our self-supervised method’s superior discriminative power, achieving an impressive macro-average AP of 0.9917 with perfect precision for Normal cases (AP = 1.0000) and outstanding scores for both RP (AP = 0.9951) and Stargardt disease (AP = 0.9799).
Figure 6.

Precision-Recall curves for the Self-supervised method across Normal, RP, and Stargardt disease categories. The exceptional Average Precision (AP) scores of 1.0000 (Normal), 0.9951 (RP), and 0.9799 (STGD) with a macro-average AP of 0.9917 demonstrate the method’s robust ability to maintain high precision even at increased recall thresholds.
Visualization and interpretability analysis
As illustrated in Fig. 7 visualizations in Grad-CAM reveal model decision-making processes for RP, Stargardt disease, and normal retinal images with high clarity. Diagnostic regions pertinent to each condition become clearly visible through the color transition from blue to red which represents minimal to maximal model contributions. Mid-peripheral pigmentary deposits and attenuated vessels show intense activation in RP patients while macular flecks and atrophic changes are highlighted in Stargardt cases, and healthy subjects exhibit even activation distribution across normal eye structures. Our model demonstrates exceptional classification accuracy while also highlighting anatomical features that match ophthalmological diagnostics through precise localization. The visualization verifies that our method effectively identifies unique spatial patterns for different pathologies which improves the model’s transparency and clinical usefulness through visual evidence supporting diagnostic reasoning.
Figure 7.
HiResCAM31 visualization results for RP, STGD, and normal retinal images, where red regions indicate areas of high importance for model classification.
Comparison with baseline and state-of-the-art methods
Table 3 provides a comprehensive comparison of state-of-the-art self-supervised learning methods(SimCLR, MoCo v2, BYOL, SwAV, and SimSiam) for the classification of RP and STGD based on fundus images. our proposed method outperforms all compared self-supervised methods, delivering the highest scores in every metric, including Accuracy (98.15%), BA (98.11%), Precision (98.17%), F1-score (98.15%), Specificity (99.2%), NPV (98.97%), and AUC (99.68%). These superior results suggest that our method is particularly effective at distinguishing subtle pathological patterns associated with RP and STGD, while maintaining robust generalization and minimizing false negatives. The consistent gains across all key metrics demonstrate not only the technical merits of our methodology but also its practical potential as a robust tool for automated retinal disease screening−offering a new benchmark for self-supervised learning in ophthalmic diagnostics.
Table 3.
Comparison of performance metrics (accuracy, BA, precision, F1-score, specificity, NPV, and AUC for various self-supervised learning methods trained on 5844 unlabeled and 782 labeled images.
| Method | Accuracy | BA | Precision | F1 | Specificity | NPV | AUC |
|---|---|---|---|---|---|---|---|
| Transfer Learning (Image Net) | 93.83 | 93.09 | 93.76 | 93.78 | 96.86 | 96.96 | 98.39 |
| SimCLR22 | 96.3 | 95.85 | 96.3 | 96.3 | 98.16 | 98.16 | 99.04 |
| MoCo v232 | 96.3 | 96.53 | 96.3 | 96.35 | 98.63 | 97.82 | 99.05 |
| BYOL33 | 94.44 | 92.85 | 94.44 | 94.31 | 96.81 | 97.85 | 97.48 |
| SimSiam34 | 97.53 | 96.86 | 97.53 | 97.52 | 98.54 | 99.01 | 99.19 |
| Our Method (Contrastive Loss, w/o left/right) | 96.91 | 95.61 | 97.88 | 96.86 | 97.13 | 99.09 | 99.62 |
| Our Method | 98.15 | 98.11 | 98.17 | 98.15 | 99.2 | 98.97 | 99.68 |
The proposed method consistently demonstrates superior results in all key metrics, setting a new benchmark for RP and STGD classification.
Discussion
Cross-study comparison
Table 4 summarizes previous deep learning approaches for RP and STGD classification, highlighting differences in imaging modalities, model architectures, and reported metrics. Ubukata et al. employed an InceptionV3 model utilizing CFP images for RP detection and achieved an accuracy of 96.97%. Sun et al. used an EfficientNet-B7 architecture applied to ultra-widefield images, reporting 93.00% accuracy. Studies by Fujinami et al. and Chen et al. explored additional imaging types and models, yielding accuracy values between 81.3% and 96.14%.
Table 4.
Comparison of deep learning methods for retinal disease classification, highlighting different image modalities, model architectures, and accuracy metrics across studies focusing on RP and STGD detection.
| Study | Image type | Disease | Model | Accuracy |
|---|---|---|---|---|
| Ubukata et al.35 | CFP | RP | InceptionV3 | 96.97% |
| (321 Samples) | ||||
| Sun et al.36 | Ultra-widefield | RP | EfficientNet-B7 | 93.00% |
| image (4574 Samples) | ||||
| Fujinami et al.37 | CFP | STGD, RP, | InceptionV3 | CFP: 88.2% |
| and FAF | occult macular | FAF: 81.3% | ||
| (417 Samples) | dystrophy | |||
| Chen et al.38 | CFP | RP | Xception | 96.14% |
| (1670 Samples) | ||||
| Guo et al.39 | CFP | Glaucoma, | MobileNetV2 | 96.2% |
| (250 Samples) | Maculopathy, | |||
| Pathological | ||||
| Myopia, RP | ||||
| Jafarbeglou.9 | CFP and IR (782 Samples) | RP, Stargardt | Multi-Input MobileNetV2 | 96.3% |
| Our Method | CFP | RP, STGD | Self-supervised (EfficientNet-B1) | 98.15% |
| (782 labeled images, | ||||
| 5844 unlabeled images) |
While these investigations advanced the application of deep learning to inherited retinal disease classification, they universally relied on modestly sized labeled datasets and fully supervised paradigms, which constrain model generalization. In contrast, the proposed self-supervised framework integrates 5,844 unlabeled and 782 labeled fundus images for sequential training. This hybrid design enables efficient utilization of both annotated and unannotated data, resulting in robust feature learning and improved discriminative ability. As a result, our model achieves 98.15% accuracy and 99.68% AUC in RP and STGD detection, outperforming all previously reported methods in both accuracy and data efficiency.
Analytical discussion
The strong performance of the proposed framework is mainly due to its use of biologically informed training that differs from conventional self supervised approaches. Previous supervised and self supervised studies achieved moderate accuracy when trained with limited labeled data and often failed to generalize to new retinal patterns. Common self supervised techniques such as SimCLR, BYOL, and MoCo v2 improved data efficiency but learned task independent features because they were developed for general vision datasets and lacked understanding of retinal anatomy. These methods rely on random cropping or color jittering which make the model focus on superficial texture rather than pathological structures.
Our framework introduces a domain specific augmentation strategy based on the natural pairing of left and right eyes. In the self supervised phase, images from both eyes of the same person are treated as positive pairs, guiding the network to align features across symmetric biological structures instead of random augmentations. This approach acts as a biologically meaningful augmentation that helps the model learn consistent retinal representations and distinguish true disease patterns from irrelevant variations.
The second advantage comes from the two phase training procedure. The model first learns generic retinal representations from 5844 unlabeled fundus images and then is fine tuned on 782 labeled samples. This gradual learning minimizes overfitting and improves stability with small labeled datasets.
In addition, the use of the EfficientNet-B1 backbone provides a balanced architecture that captures both global retinal topology and local lesion details related to RP and STGD with high fidelity and reasonable computational cost.
Unlike many self supervised frameworks that train on ordinary natural images, our method was pre-trained entirely on unlabeled fundus images. This ensures that learned features are directly relevant to retinal anatomy rather than generic visual objects. As a result, the proposed framework shows stronger feature representation, better interpretability, and higher accuracy than existing self supervised methods. It demonstrates that incorporating biological structure into self supervised learning can significantly improve performance in medical image analysis.
Conclusions
In this study, we proposed a self-supervised learning method for automated detection of RP and STGD from retinal images. By leveraging large-scale unlabeled data and introducing a novel augmentation method using paired left and right eye images, our method enables effective feature learning even with limited labeled samples. The two-phase training strategy−self-supervised pre-training followed by supervised fine-tuning−demonstrated strong generalization and robustness in medical image classification.
Experimental results showed that our method significantly outperformed state-of-the-arts supervised and other self-supervised methods, achieving 98.15% accuracy and 99.68% AUC with the EfficientNet-B1 architecture. These findings highlight the value of self-supervised learning for developing reliable and scalable AI-based diagnostic tools, especially in domains where labeled data is scarce.
Acknowledgements
We acknowledge the use of AI tools, including ChatGPT and Grammarly, for improving the grammar and clarity of this manuscript. This article has been taken from the disease registry, titled “The Iranian National Registry for Inherited Retinal Dystrophy (IRDReg®)” and the code number of IR.SBMU.ORC.REC.1396.15, supported by the Deputy of Research and Technology at Shahid Beheshti University of Medical Sciences (http://dregistry.sbmu.ac.ir).
Author contributions
A.K. wrote the manuscript, designed and conducted the experiments, and developed and implemented the proposed method. H.A. contributed to results interpretation and manuscript writing. M.S.T. wrote the manuscript, designed and conducted the experiments. N.D. conducted patients examination. A.F.A. was responsible for data collection and labeling. S.F. conducted patients examination. H.G. managed and supervised the project. H.B.J. contributed to data collection. M.Y. was responsible for data management and extraction. All authors reviewed the manuscript. M.S.P. managed and supervised the project. H.S. contributed to data collection, results interpretation, and manuscript writing. All authors reviewed the manuscript.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Data availability
The datasets used in the current study are available at: http://www.github.com/riovs/ird_dataset.
Declarations
Competing interests
The authors declare no competing interests.
Ethical approval
In accordance with ethical guidelines, studies involving human participants were reviewed and approved by the Iranian National Registry for Inherited Retinal Dystrophy (IRDReg®) under code number IR.SBMU.ORC.REC.1396.15. This study is supported by the Deputy of Research and Technology at Shahid Beheshti University of Medical Sciences (http://dregistry.sbmu.ac.ir). Additionally, explicit consent was obtained from each individual for the publication of any images in this manuscript.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Esengoenuel, M., Marta, A., Beirao, J., Pires, I. M. & Cunha, A. A systematic review of artificial intelligence applications used for inherited retinal disease management. Medicina58, 504 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Fujinami-Yokokawa, Y. et al. Prediction of causative genes in inherited retinal disorders from spectral-domain optical coherence tomography utilizing deep learning techniques. J. Ophthalmol.2019, 1691064 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Shah, M., Roomans Ledo, A. & Rittscher, J. Automated classification of normal and stargardt disease optical coherence tomography images using deep learning. Acta Ophthalmol.98, e715–e721 (2020). [DOI] [PubMed]
- 4.Davidson, B. et al. Automatic cone photoreceptor localisation in healthy and stargardt afflicted retinas using deep learning. Sci. Rep.8, 7911 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Miere, A. et al. Deep learning-based classification of retinal atrophy using fundus autofluorescence imaging. Comput. Biol. Med.130, 104198 (2021). [DOI] [PubMed] [Google Scholar]
- 6.Charng, J. et al. Deep learning segmentation of hyperautofluorescent fleck lesions in stargardt disease. Sci. Rep.10, 16491 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Iadanza, E. et al. Automatic detection of genetic diseases in pediatric age using pupillometry. IEEE Access8, 34949–34961 (2020). [Google Scholar]
- 8.Miere, A. et al. Deep learning-based classification of inherited retinal diseases using fundus autofluorescence. J. Clin. Med.9, 3303 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Jafarbeglou, F. et al. A deep learning model for diagnosis of inherited retinal diseases. Sci. Rep.15, 22523 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Güven, A., Karaman, B., Öner, A. & Sinim Kahraman, N. Detection of retinitis pigmentosa stages with gan and transfer learning in maps of mferg p1 wave amplitudes. SIViP19, 528 (2025). [Google Scholar]
- 11.Karaman, B., Güven, A., Öner, A. & Kahraman, N. S. Classification of retinitis pigmentosa stages based on machine learning by fusion of image features of vf and mferg maps. Electronics14, 1867 (2025). [Google Scholar]
- 12.Ferreira, H. et al. Retinitis pigmentosa classification with deep learning and integrated gradients analysis. Appl. Sci.15, 2181 (2025). [Google Scholar]
- 13.Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst.25 (2012).
- 14.Simonyan, K. Very deep convolutional networks for large-scale image recognition. arXiv preprintarXiv:1409.1556 (2014).
- 15.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).
- 16.Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2818–2826 (2016).
- 17.Howard, A. G. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprintarXiv:1704.04861 (2017).
- 18.Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4510–4520 (2018).
- 19.Tan, M. & Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, 6105–6114 (PMLR, 2019).
- 20.Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Commun. ACM60, 84–90 (2017). [Google Scholar]
- 21.Van Gansbeke, W., Vandenhende, S., Georgoulis, S., Proesmans, M. & Van Gool, L. Scan: Learning to classify images without labels. In European conference on computer vision, 268–285 (Springer, 2020).
- 22.Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning, 1597–1607 (PMLR, 2020).
- 23.Chen, T., Kornblith, S., Swersky, K., Norouzi, M. & Hinton, G. E. Big self-supervised models are strong semi-supervised learners. Adv. Neural. Inf. Process. Syst.33, 22243–22255 (2020). [Google Scholar]
- 24.Shannon, C. E. A mathematical theory of communication. Bell Syst. Tech. J.27, 379–423 (1948). [Google Scholar]
- 25.Ruder, S. An overview of gradient descent optimization algorithms. arXiv preprintarXiv:1609.04747 (2016).
- 26.Sokolova, M. & Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag.45, 427–437 (2009). [Google Scholar]
- 27.Fletcher, R. H., Fletcher, S. W. & Fletcher, G. S. Clinical epidemiology: the essentials (Lippincott Williams & Wilkins, 2012).
- 28.Tan, M. et al. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2820–2828 (2019).
- 29.Ma, N., Zhang, X., Zheng, H.-T. & Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), 116–131 (2018).
- 30.Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4700–4708 (2017).
- 31.Draelos, R. L. & Carin, L. Use hirescam instead of grad-cam for faithful explanations of convolutional neural networks. arXiv preprintarXiv:2011.08891 (2020).
- 32.Chen, X., Fan, H., Girshick, R. & He, K. Improved baselines with momentum contrastive learning. arXiv preprintarXiv:2003.04297 (2020).
- 33.Grill, J.-B. et al. Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural. Inf. Process. Syst.33, 21271–21284 (2020). [Google Scholar]
- 34.Chen, X. & He, K. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 15750–15758 (2021).
- 35.Ubukata, S. et al. Fundus image analysis of retinitis pigmentosa using artificial intelligence. PREPRINT (Version 1) available at Research Square10.21203/rs.3.rs-4851616/v1 (2024).
- 36.Sun, G. et al. Deep learning for the detection of multiple fundus diseases using ultra-widefield images. Ophthalmol. Therapy12, 895–907 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Fujinami-Yokokawa, Y. et al. Prediction of causative genes in inherited retinal disorder from fundus photography and autofluorescence imaging using deep learning techniques. Br. J. Ophthalmol.105, 1272–1279 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Chen, T.-C. et al. Artificial intelligence-assisted early detection of retinitis pigmentosathe most common inherited retinal degeneration. J. Digit. Imaging34, 948–958 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Guo, C., Yu, M. & Li, J. Prediction of different eye diseases based on fundus photography via deep transfer learning. J. Clin. Med.10, 5481 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets used in the current study are available at: http://www.github.com/riovs/ird_dataset.












