Skip to main content
Journal of Imaging Informatics in Medicine logoLink to Journal of Imaging Informatics in Medicine
. 2025 Apr 29;39(1):714–731. doi: 10.1007/s10278-025-01513-7

Low-Rank Fine-Tuning Meets Cross-modal Analysis: A Robust Framework for Age-Related Macular Degeneration Categorization

Baochen Zhen 1, Yongbin Qi 1, Zizhen Tang 2, Chaoyong Liu 1, Shilin Zhao 1, Yansuo Yu 1,, Qiang Liu 1
PMCID: PMC12920962  PMID: 40301288

Abstract

Age-related macular degeneration (AMD) is a prevalent retinal degenerative disease among the elderly and is a major cause of irreversible vision loss worldwide. Although color fundus photography (CFP) and optical coherence tomography (OCT) are widely used for AMD diagnosis, information from a single modal is inadequate to fully capture the complex pathological features of AMD. To address this, this study proposes an innovative multi-modal deep learning framework that fine-tunes pre-trained single-modal retinal models for efficient application in multi-modal AMD categorization tasks. Specifically, two independent vision transformer models are used to extract features from CFP and OCT images, followed by deep canonical correlation analysis (DCCA) to perform nonlinear mapping and fusion of features from both modalities, maximizing cross-modal feature correlation. Moreover, to reduce the computational complexity of multi-modal integration, we introduce the low-rank adaptation (LoRA) technique, which uses low-rank decomposition of parameter matrices, achieving superior performance compared to full fine-tuning with only about 0.49% of the trainable parameters. Experimental results on the public dataset MMC-AMD validate the framework’s effectiveness. The proposed model achieves an overall F1-score of 0.948, AUC-ROC of 0.991, and accuracy of 0.949, significantly outperforming existing single-modal and multi-modal baseline models, particularly excelling in recognizing complex pathological categories.

Keywords: Multi-modal AMD categorization, Vision transformer, Fine-tuning, DCCA, LoRA

Introduction

Age-related macular degeneration (AMD) is a highly prevalent retinal degenerative disease among the elderly and is one of the leading causes of irreversible vision loss worldwide [1]. Based on pathological characteristics and disease progression, AMD can be further classified into dryAMD, wetAMD [2], and polypoidal choroidal vasculopathy (PCV) [3, 4]. Given the differences in treatment approaches among these subtypes, accurately distinguishing normal retina from different AMD subtypes holds significant clinical importance for developing personalized treatment strategies and improving patient prognosis. In AMD diagnosis, color fundus photography (CFP) and optical coherence tomography (OCT) are two commonly used non-invasive imaging techniques. CFP provides high-resolution color images of the retinal surface, aiding in the detection of surface abnormalities, but has limited capability in recognizing changes in the deeper retinal structures. In contrast, OCT offers detailed insights into the layered structure of the retina, revealing features of deeper lesions, though it is relatively less effective in assessing surface color and texture. Normal CFP and OCT images, along with images exhibiting specific AMD-related pathologies, are illustrated in Fig. 1. A recent study [5] reported that when provided with multi-modal information (CFP/OCT images and clinical records), experts made fewer diagnostic errors in referral recommendations.

Fig. 1.

Fig. 1

Color fundus images of specific disease eyes (first row) and OCT B-scan images (second row). Color fundus images show a planar view of the fundus, while OCT B-scans provide cross-sectional information about the retina and choroid. Each set of images is randomly selected from our dataset based on their categories

Due to a shortage of experienced ophthalmologists, increasing research efforts have focused on automated screening methods for AMD based on CFP images [69], OCT images [1012], or a combination of both [13, 14]. However, the accuracy and robustness of these methods depend on the availability of large, high-quality annotated datasets, which are scarce and difficult to obtain in the field of ophthalmology due to high annotation costs and privacy concerns [15]. This poses a significant challenge for the development and optimization of automated screening systems. Consequently, utilizing pre-trained weights from ImageNet [16] as a starting point for transfer learning has gradually become a common strategy. Recently, a model named RETFound [17] has garnered significant attention. As a general retinal foundation model based on vision transformer [18], RETFound learns rich visual feature representations through pre-training on a large dataset of unlabeled retinal images, allowing for efficient fine-tuning with limited labeled data in subsequent tasks. Notably, this model excels at recognizing complex patterns and features related to eye health in a single-modal.

However, as a single-modal model, RETFound’s architecture can only process single-modal data, which somewhat limits its ability to comprehensively capture the features of retinal lesions. Relying solely on single-modal data when diagnosing complex lesions such as AMD may lead to the loss of critical pathological information. In contrast, multi-modal data fusion can combine the strengths of different modalities, providing a more comprehensive and detailed description of lesions. Therefore, integrating multi-modal information into the RETFound model during downstream task fine-tuning to enhance its capability to recognize complex AMD lesions and improve diagnostic accuracy is an important avenue worth exploring.

In practical applications, training and fine-tuning multi-modal models face dual challenges of computational complexity and modality fusion. As the number of modalities increases, the number of model parameters will grow exponentially, leading to significant increases in computational overhead, especially when integrating CFP and OCT images for AMD diagnosis. Additionally, the heterogeneity between features from different modalities further complicates effective information fusion. How to fully leverage the complementary advantages of each modality while ensuring computational efficiency remains a key issue. If these challenges are not effectively addressed, the potential of RETFound in multi-modal AMD diagnosis will not be fully realized.

Inspired by the above discussions, this study aims to achieve multi-modal AMD recognition through fine-tuning a large pre-trained single-modal retinal foundation model. This process focuses not only on how to effectively extend the model to integrate multi-modal data during fine-tuning but also emphasizes reducing computational costs and optimizing information fusion between modalities. Considering the significant increase in computational complexity during the integration of multi-modal data, this study employs the low-rank adaptation (LoRA) method [19], which introduces an adaptation mechanism using low-rank matrices, decomposing the full parameter matrix into a product of two low-rank matrices, allowing for optimization adjustments to only a subset of model parameters during fine-tuning. Unlike previous works that primarily applied LoRA in single-modal contexts, this study systematically explores its effectiveness in multi-modal medical imaging for the first time, demonstrating substantial parameter reduction (over 99%) along with significant performance enhancement. To bridge the heterogeneity of features from different modalities, we introduce deep canonical correlation analysis (DCCA) [20]. Specifically, we perform fine-grained nonlinear feature transformation between CFP and OCT modality features through two independent neural networks, coupled with classic canonical correlation analysis (CCA) [21] for regularization, to maximize the linear correlation between the two modalities’ features. Notably, integrating DCCA with LoRA represents a novel methodological combination that simultaneously resolves critical issues of parameter redundancy and cross-modal heterogeneity in multi-modal AMD categorization, which has not been fully explored in previous research. During training, features from both modalities are mapped to a coordinated high-dimensional hyperspace, enabling the model to learn highly correlated multi-modal features. Notably, this study fully retains the original network structure of the large pre-trained single-modal retinal foundation model, ensuring that the feature processing capability in the single-modal is not compromised.

In summary, the contributions of this study are primarily reflected in the following four aspects:

  • We effectively extend a large pre-trained single-modal retinal foundation model into a model for multi-modal AMD recognition through fine-tuning. The proposed method retains the powerful processing capabilities of the large single-modal model while achieving multi-modal recognition.

  • To address the challenges posed by the increased computational complexity during multi-modal data integration, we decompose the model parameters into low-rank matrices, requiring only a limited number of parameters to be optimized during the fine-tuning phase, thereby significantly reducing the demand for computational resources.

  • To tackle the heterogeneity between different modality features, we employ independent neural networks to perform nonlinear transformations of CFP and OCT modality features, enabling the learning of highly correlated multi-modal features.

  • We validated the proposed method on the publicly available MMC-AMD dataset, and experimental results demonstrate that our method achieved optimal performance in the AMD classification task compared to the best single-modal and multi-modal baseline models, further confirming its effectiveness.

Related Works

Traditional AMD classification methods [22] rely on features manually extracted by experts, which are typically based on morphological and statistical properties of the images. In recent years, deep learning has made significant progress in the field of medical image analysis, particularly in AMD classification. Relevant research has focused on three main areas: deep learning-based AMD classification methods, pre-training techniques in ophthalmic disease classification, and efficient fine-tuning strategies. The intersection of these research directions provides the theoretical foundation and technical support for the proposed multi-modal AMD recognition method in this study.

Deep Learning-Based AMD Classification

The application of deep learning in AMD classification initially focused on single-modal data, specifically CFP images [69, 23], or OCT images [10]. In [6] and [23], Burlina et al. processed AMD classification using classical image classification methods (i.e., feature extraction followed by classifier training). Specifically, the authors extracted visual features from CFP images using CNNs, followed by a linear SVM as the classifier. In subsequent work [7], Burlina et al. found that end-to-end trained CNNs outperformed classical methods. Grassmann et al. utilized ensemble learning combined with multiple CNNs [8]. Similarly, Lee et al. [10] employed the VGG16 model to classify OCT images as either normal or AMD, showing improved performance in detecting early AMD compared to CFP-based models. Additionally, Vannadil et al. [9] leveraged the capabilities of ViT [18] to capture global context and long-range dependencies in fundus images, making the model more reliant on the extraction of local features related to AMD.

Despite the significant achievements of single-modal deep learning models in AMD classification, their primary limitation lies in their inability to fully utilize the complementary information provided by different imaging techniques. To overcome this limitation, research has gradually shifted towards multi-modal AMD classification. Yoo et al. [13] made an initial attempt at AMD classification by utilizing both CFP and OCT images. However, similar to the single-modal context [6, 23], the authors followed classical image classification algorithms, using the VGG19 model to extract fundus image features from both modalities and then concatenating these features as input to a random forest classifier. Subsequently, researchers further optimized the architecture of multi-modal models. Wang et al. proposed a dual-branch CNN model [14] that processes CFP and OCT images separately, performing fusion at multiple feature levels, thereby enhancing the model’s overall discriminative ability in identifying AMD lesions. Compared to single-modal deep learning models, multi-modal methods better leverage information from various imaging techniques, improving classification accuracy and model robustness for AMD.

Pre-training Techniques in Ophthalmic Disease Classification

Pre-training techniques have been widely applied in ophthalmic disease classification, particularly in cases where dataset labeling is scarce, effectively enhancing model generalization capabilities [24]. Early work primarily involved pre-training CNN models on large-scale datasets such as ImageNet [16] and transferring them to specific ophthalmic disease tasks [23, 25, 26]. Kermany et al. [25] utilized a pre-trained ResNet model to process OCT images, achieving significant results in various ophthalmic disease classification tasks. Shen et al. [26] also employed a similar approach, using pre-trained deep learning models for early detection of glaucoma. With the development of deep learning models, ViT models based on self-attention mechanisms [18] have demonstrated strong capabilities in capturing global features in image classification tasks, excelling in fundus image analysis through pre-training [27]. In recent years, self-supervised pre-training strategies based on retinal images have also emerged. For instance, Li et al. [28] proposed a model that, through unsupervised training on large-scale retinal images, exhibited exceptional performance in subsequent AMD detection. Similarly, Zhou et al. [17] developed a general retinal base model through pre-training on large-scale unlabelled images, demonstrating robust performance across various ophthalmic disease diagnosis tasks. However, most current pre-trained models primarily focus on single-modal data and have not fully exploited the complementary information from multi-modal imaging. Therefore, this study builds on existing pre-training strategies to further explore the integration and application of multi-modal data to enhance AMD classification performance.

The Evolution of Efficient Fine-Tuning

With the advancement of deep learning technology, fine-tuning strategies have evolved to improve the efficiency of models in specific tasks. Early fine-tuning methods [2931] primarily focused on freezing certain model layers and only updating parameters in the last few layers to reduce computational load. Specifically, Howard et al. [29] achieved effective fine-tuning in natural language processing tasks by freezing the initial layers and only updating the parameters of the output layer. Similarly, Yosinski et al. [30] demonstrated the effectiveness of this strategy in natural image processing tasks by freezing the initial layers of CNNs. Vrbančič et al. proposed an adaptive fine-tuning mechanism for transfer learning [31] that automatically determines which layers of CNNs to fine-tune for a given image set. However, layer freezing methods fail to fully leverage the knowledge of pre-trained models, limiting the potential performance of the models. To address this, Han et al. [32] proposed a parameter pruning strategy that reduces the storage and computational resources required by neural networks by pruning redundant connections while maintaining accuracy. This approach effectively reduces the computational burden, but excessive pruning may degrade model performance in complex tasks. In subsequent work [33], Houlsby et al. designed an adapter network that provides efficient parameter optimization for NLP, achieving performance on par with fully fine-tuned BERT models [34] using only 3% of task-specific parameters. In contrast, LoRA [19] employs a more refined matrix decomposition strategy, decomposing the weight matrices of pre-trained models into low-rank matrices and only updating the newly inserted components, thereby preserving the original structure of the model and further enhancing computational efficiency and task adaptability. These advancements provide a crucial foundation for this research, facilitating the effective integration and application of multi-modal data in AMD classification tasks.

Methodology

Overall Framework

This study proposes a novel deep learning model that leverages the joint diagnostic potential of CFP and OCT images for the automatic detection of multi-modal AMD. Figure 2 illustrates the overall framework of our proposed method. We effectively extend the pre-trained retinal foundation model through fine-tuning to create a model for multi-modal AMD recognition. Specifically, CFP and OCT images are processed by two independent ViT-large models for feature extraction, generating high-dimensional feature representations for each modality. These feature representations are input into the DCCA module after undergoing nonlinear transformations. DCCA performs fine-grained nonlinear mapping on the features of different modalities through two independent neural networks, aiming to maximize the nonlinear correlation between the two modalities, thus achieving efficient cross-modal feature fusion. Subsequently, we employ a concatenation fusion strategy to effectively merge the transformed features from both modalities, forming a comprehensive feature vector. Finally, we utilize the fused features to train a classifier for AMD classification. To address the challenge of significantly increased computational complexity during the multi-modal integration process, this study introduces LoRA technology for model fine-tuning. LoRA optimizes a high-dimensional parameter matrix by decomposing it into two low-rank matrices, allowing only a small number of parameters to be adjusted during fine-tuning.

Fig. 2.

Fig. 2

The overall architecture for multi-modal AMD categorization. Stage one constructs RETFound model using CFP and OCT images from three datasets by means of self-supervised learning. In the stage two, the model is fine-tuned for multi-modal AMD diagnosis using supervised learning, incorporating the LoRA technique. Parameters marked with a “lock” are frozen and remain unchanged, while those without a “lock” are updated during training. Next, DCCA technology is used to integrate features from both modalities, maximizing cross-modal correlation, followed by feature fusion through a concatenation strategy. Ultimately, the fused feature representation is used to train a classifier for AMD detection

In our model architecture design, we adopt a staged feature embedding strategy. Initially, the ViT encoder blocks are initialized with the pre-trained weights from the RETFound model to ensure efficient processing of single-modal data. During the multi-modal fine-tuning phase, some encoder blocks are frozen, and only certain layers undergo LoRA fine-tuning, enabling the model to adapt to the requirements of the multi-modal AMD task.

Feature Vector Extraction with a Foundation Model

The feature extraction component of our proposed model is based on the ViT-large encoder from the RETFound model [17], which is trained on a large scale of unlabeled retinal images through self-supervised learning and is applicable to eye and systemic disease detection tasks with clear labels. This model employs a specific configuration of an autoencoder architecture, consisting of an encoder and a decoder. The encoder utilizes the ViT-large structure, which includes 24 transformer blocks and a 1024-dimensional embedding vector. The encoder processes unmasked image patches of size 16 × 16 pixels, projecting these patches into a 1024-dimensional feature vector. The transformer blocks generate high-level features through multi-head self-attention mechanisms and multi-layer perceptrons. The decoder part is structured as ViT-small, which reconstructs masked image patches using 8 transformer blocks and a 512-dimensional embedding vector.

In our framework, the decoder part of the base model is discarded, and we retain only the ViT-large encoder for feature extraction. CFP and OCT images are processed by two independent ViT-large encoders, ensuring independent feature extraction for each modality. Specifically, let the input for the CFP image be IcfpRH×W and for the OCT image be IoctRH×W, Each image is first processed through a patch embedding layer, which segments it into fixed-size patches, followed by positional encoding to generate the corresponding input feature vectors, denoted as xcfpR1024 and xoctR1024. Each ViT-large encoder processes its respective input through a series of transformer blocks, each containing multi-head self-attention mechanisms and feedforward neural networks, ultimately yielding two high-dimensional feature vectors fcfpR1024 anda foctR1024. These feature vectors represent the encoded representation of the images, which capture significant information from different modalities. Notably, while the CFP and OCT images are processed in independent encoders, they follow the same feature extraction process to ensure consistency of cross-modal features for subsequent feature fusion.

Low-Rank Adaptation for Multi-modal AMD Classification

In our proposed multi-modal AMD classification framework, to address the high computational demands of fine-tuning the cross-modal feature fusion of CFP and OCT images, we employ low-rank adaptation (LoRA) technology [19] for precise adjustments of model parameters. As shown in Fig. 3, the core idea of LoRA is to optimize the self-attention layers of the transformer through low-rank decomposition to achieve effective control of computational complexity.

Fig. 3.

Fig. 3

Schematic diagram of the LoRA method for multi-modal AMD diagnosis. We inject the pre-trained query matrix WQ and value projection matrix WV of each self-attention layer from different modalities with the low-rank decomposition matrices (represented as Acfp, Aoct, Bcfp and Boct). The different colored modules correspond to different AMD modalities

Specifically, after processing the input CFP and OCT images through patch embedding and positional encoding, they are converted to input feature vectors xcfpR1024 and xoctR1024. These vectors are then sent to their respective ViT-large encoders, each comprising multiple transformer blocks that contain the query matrix WQ, key matrix WK, and value projection matrix WV of the multi-head self-attention mechanism. In the self-attention layer, LoRA optimizes weight updates by performing low-rank decomposition on WQ and WV, which can be expressed as follows:

ΔW=BA 1

where BRd×r and ARd×r are matrices produced by low-rank decomposition, with rank r much smaller than dimension d. This design aims to reduce the computational load during the model fine-tuning phase.

After LoRA processing, the feature vectors for the CFP and OCT modalities, denoted as hcfpR1×1024 and hoctR1×1024, are generated by the following formulas, which will be used for subsequent multi-modal fusion and classification tasks:

h=W0x+ΔWx=W0x+BAx 2

where W0 represents the original pre-trained weights, and ΔW is the adjusted weight update through LoRA.

This approach ensures independent processing and updating of features from CFP and OCT images, providing efficient and accurate feature representation for multi-modal data before fusion. Furthermore, we set the rank r of the low-rank matrices to 4, empirically balancing the complexity of the model and the computational efficiency. During model fine-tuning, we only adjust these low-rank matrices while keeping other parameters unchanged, effectively reducing training time and computational resources while maintaining model performance.

Deep Canonical Correlation Analysis for Multi-modal AMD Classification

This paper introduces Deep Canonical Correlation Analysis (DCCA) [20] into the multi-modal AMD classification task, aiming to address the challenges of heterogeneity and nonlinear feature alignment and correlation modeling between CFP and OCT modalities. The core idea of DCCA is to perform nonlinear mapping of the input features of each modality through deep neural networks, transforming them into a shared latent space. By maximizing the correlation of features from different modalities in this space, DCCA enhances the synergistic effect between modalities, thereby improving the effectiveness of multi-modal feature fusion. Figure 4 shows the DCCA structure used in this study.

Fig. 4.

Fig. 4

The structure of DCCA. Different modalities undergo independent transformations through their respective neural networks. The outputs (Ocfp and Ooct) are regularized using traditional CCA constraints, with parameters updated to maximize the CCA metrics between the different modalities

Specifically, DCCA consists of two independent neural networks that process the features of the CFP and OCT modalities, respectively. Let hcfpRN×d1 and hoctRN×d2 represent the instance feature matrices from CFP and OCT, respectively. Here, N is the number of instances, and d1 and d2 are the feature dimensions of the two modalities. The input features of each modality undergo multi-layer nonlinear transformations through their respective deep neural networks, generating two new feature representations OcfpRN×d and OoctRN×d as follows:

Ocfp=fcfp(Xcfp,W1) 3
Ooct=foct(Xoct,W2) 4

where W1 and W2 are the parameters of the neural networks’ nonlinear transformations, and d denotes the output dimension of DCCA. To measure the correlation between the two modalities’ features, we employ canonical correlation analysis (CCA) [21]. First, we center the output features of both modalities and then construct the cross-modal covariance matrices:

Σ11=1NOcfpOcfp 5
Σ22=1NOoctOoct 6
Σ12=1NOcfpOoct 7

where Σ11 and Σ22 represent the self-covariance matrices of the features from both modalities, where Σ12 is the cross-covariance matrix. The goal of DCCA is to jointly optimize the neural network parameters W1 and W2 to maximize the correlation between cross-modal features Ocfp and Ooct in a shared space, which can be expressed by the following formula:

ρ(W1,W2)=Σ11-1/2Σ12Σ22-1/2 8

After training the two neural networks, the projected directions of the transformed features Ocfp and Ooct in the shared space are adjusted to maximize their correlation. Based on the aforementioned construction process, DCCA provides the following advantages for multi-modal AMD recognition:

  • By separately transforming the different modalities, we can explicitly extract the variation features of both modalities (Ocfp and Ooct), facilitating the examination of the properties and relationships between the two modalities.

  • With specified CCA constraints, we can regulate the nonlinear mappings (fcfp(·) and foct(·)) while ensuring that the model retains information relevant to AMD.

Overall Loss

We use cross-entropy loss as the classification loss LCE for optimization:

LCE=-1Ni=1Nj=1Cyi,jlog(y^i,j) 9

where N is the number of samples, C is the number of classes, yi,j is the actual label for sample i, and y^i,j is the predicted probability of sample i being classified as class j.

In addition to the classification loss, we introduce the DCCA loss function, aimed at enhancing the correlation between multi-modal features. To accommodate the gradient descent optimization process, the objective function of DCCA is transformed into negative correlation, allowing it to be optimized through minimization. Its loss function is defined as follows:

LDCCA(W1,W2)=-corr(Ocfp,Ooct) 10

The overall loss of the model is a weighted combination of the classification loss and DCCA loss:

Ltotal=LCE+λLDCCA 11

where λ is the loss weight used to balance the impact of the two loss components on model training. The overall loss function takes into account the optimization of feature extraction correlation and the accuracy of the classification task, achieving synergistic optimization of both. Through this dual optimization strategy, the model can learn highly correlated multi-modal features while demonstrating higher accuracy in classification tasks, thus achieving superior overall performance in multi-modal AMD classification tasks.

Experiments

Experimental Setup

Dataset

To evaluate the effectiveness of the model we developed for multi-modal AMD classification, we used the publicly available MMC-AMD medical imaging dataset. This dataset consists of 1094 CFP images and 1289 OCT images from 1093 distinct eyes, which were utilized to train and test our multi-modal automatic classification model for AMD. The MMC-AMD dataset is categorized into four classes: normal, dryAMD, PCV, and wetAMD.

Regarding the image pairing strategy, CFP and OCT images from the same eye, representing different slices, were paired. To prevent data leakage and ensure independent evaluation, dataset partitioning was conducted at the patient level, meaning that images from the same patient never appeared in both the training and test sets simultaneously. The dataset was stratified and divided into 70% training, 15% validation, and 15% testing subsets, maintaining a balanced class distribution across all sets. This resulted in a total of 1289 multi-modal AMD image pairs, with 221 pairs for normal, 175 pairs for dry AMD, 341 pairs for PCV, and 552 pairs for wet AMD.

Implementation Details

First, the raw images are standardized, including pixel value normalization and resizing to 224×224 pixels. Contrast-limited adaptive histogram equalization (CLAHE) [35] is applied to enhance the quality of the CFP images, while a 3×3 kernel median filter is used for the OCT images to reduce noise. It is important to note that these preprocessing methods do not replace traditional data augmentation strategies. Prior to applying the above preprocessing techniques, basic data augmentation is performed on the training images, including random cropping, flipping, rotation, and random adjustments of contrast, saturation, and brightness. Furthermore, since CFP images are three-channel color images and OCT images are single-channel grayscale images, a channel replication method is employed to extend the single-channel OCT images to three channels, ensuring proper pairing and alignment when inputting both image types into the model.

To further enhance model generalization and reduce the impact of paired data correlation, stratified k-fold cross-validation (k=5) is employed, ensuring that each fold maintains a similar distribution of AMD subtypes while preventing data from the same eye from appearing across different folds. This approach allows the model to be trained and validated on diverse subsets, reducing the risk of overfitting to correlated samples.

Additionally, to mitigate the impact of class imbalance on classification results, for classes with fewer samples (e.g., dry AMD and PCV), we perform oversampling via controlled augmentation-based replication to ensure a balanced class distribution without artificially inflating model confidence in duplicate samples; this approach employs moderate augmentation techniques that preserve the intrinsic variability of the data while avoiding redundancy and is supported by previous findings in the literature, as Li et al. [36] demonstrate that oversampling effectively compensates for reduced sample sizes and enhances classification performance, thereby stabilizing the learning process and improving the model’s robustness and generalization capability.

All experiments are performed under consistent conditions regarding data processing, augmentation operations, model architectures, and training epochs. Specifically, the Adam optimizer is employed with a learning rate of 1e-3, momentum of 0.9, weight decay of 1e-4, and a batch size of 16, while the loss weight is set to 0.5. All models are implemented on an NVIDIA RTX 4090 GPU, with the best model checkpoint selected based on performance on the development set.

Performance Metrics

In this study, a range of key evaluation metrics is employed to comprehensively assess the performance of the proposed multi-modal AMD classification model. First, accuracy is used to measure the overall classification correctness, while the area under the receiver operating characteristic curve (AUC-ROC) [37] provides a quantifiable assessment of the model’s ability to distinguish between positive and negative samples. In addition, the F1-score is calculated, considering the balance between precision and recall, ensuring that the model not only classifies accurately but also efficiently captures positive samples. To further integrate these metrics, the F1-score is calculated, considering the balance between precision and recall, ensuring that the model not only classifies accurately but also efficiently captures all positive samples. To mitigate performance fluctuations caused by randomness, each model undergoes five independent experimental runs during both the training and evaluation phases, and the mean values of the evaluation metrics are reported. Furthermore, paired t-tests are performed to compute p-values and confidence intervals for the classification performance metrics across multiple runs. This rigorous statistical analysis confirms that the reported improvements are statistically significant and not attributable to random variations.

Performance Comparison

The evaluation of the performance of single-modal and multi-modal models in the AMD classification task is essential for optimizing disease recognition methods. This study systematically conducts a series of experiments to thoroughly compare the performance and limitations of different models. As shown in Table 1, CFP-RET and OCT-RET represent single-modal models based on CFP and OCT images, respectively, while the multi-modal models MM-RET, MM-RET-LoRA, MM-RET-DCCA, and ours incorporate various technical components to explore the advantages of multi-modal data fusion. Additionally, the confusion matrix shown in Fig. 5 further quantifies the classification performance of each model across various AMD subtypes, offering a more detailed analysis of model performance. The following sections will explore the specific performance of single-modal and multi-modal models in the AMD classification task and provide a detailed analysis of how each component contributes to the performance of the multi-modal model.

Table 1.

Comparison of different metrics (standard deviation) for single-modal and multi-modal models

Model F1-score AUC-ROC Accuracy
normal dryAMD PCV wetAMD Overall
CFP-RET 1.000 (0.000) 0.787 (0.018) 0.745 (0.054) 0.550 (0.048) 0.771 (0.026) 0.929 (0.022) 0.775 (0.023)
OCT-RET 1.000 (0.000) 0.714 (0.017) 0.758 (0.010) 0.587 (0.010) 0.765 (0.004) 0.937 (0.002) 0.760 (0.005)
MM-RET 0.992 (0.005) 0.958 (0.024) 0.687 (0.042) 0.641 (0.079) 0.820 (0.033) 0.948 (0.013) 0.827 (0.029)
MM-RET-LoRA 1.000 (0.000) 0.994 (0.003) 0.801 (0.036) 0.805 (0.034) 0.900 (0.018) 0.982 (0.004) 0.902 (0.017)
MM-RET-DCCA 0.999 (0.003) 0.989 (0.008) 0.857 (0.034) 0.857 (0.040) 0.926 (0.020) 0.989 (0.004) 0.928 (0.019)
Ours 0.999 (0.003) 0.996 (0.006) 0.897 (0.034) 0.899 (0.041) 0.948 (0.020) 0.991 (0.005) 0.949 (0.020)

All models are pre-trained on the baseline model. CFP-RET and OCT-RET represent the single-modal models based on CFP and OCT images, respectively. MM-RET is the multi-modal model using a simple fusion strategy, MM-RET-LoRA incorporates the LoRA method to optimize parameter efficiency, MM-RET-DCCA integrates DCCA technology to enhance the correlation of cross-modal features, “Ours” refers to the model proposed in this study. Each model was trained and evaluated on the MMC-AMD dataset five times, and the average score for each metric is reported. p-values are calculated between our proposed model and other models. *p-value 0.05, **p-value 0.01, ***p-value 0.001

Bold values indicate the best performance metrics in each column

Fig. 5.

Fig. 5

Confusion matrices. In reviewing the five training and evaluation runs for each model, we plot the normalized confusion matrix for each model at its median performance to ensure a fair comparison. Our proposed model effectively reduces the confusion between PCV and wetAMD

Comparison of Single-Modal and Multi-modal Performance

The performance of single-modal models in the AMD classification task shows a clear dependence on the categories. As shown in Table 1, the CFP-RET model performs perfectly on the normal category (F1 = 1.000) but exhibits considerable difficulty in recognizing certain complex lesions, such as wetAMD (F1 = 0.550). This suggests that using CFP images alone for pathological recognition faces challenges in complex cases. By contrast, the OCT-RET model offers slightly improved recognition for wetAMD (F1 = 0.587), although it still struggles with dryAMD and PCV. The confusion matrix in Fig. 5 provides further details, showing that the CFP-RET model is as effective as the OCT-RET model in identifying normal categories and excels in recognizing true PCV cases. Nevertheless, both single-modal models misclassify a considerable portion of complex AMD lesions (e.g., wetAMD and dryAMD), underscoring their limitations in capturing the full pathological spectrum.

Regarding the performance of multi-modal models, the MM-RET model shows enhanced robustness in multi-modal data by combining features from CFP and OCT. For instance, in the recognition of dryAMD and wetAMD, the F1-scores of MM-RET reach 0.958 and 0.641, respectively, with an overall accuracy of 0.827, significantly outperforming single-modal models. This highlights the effectiveness of the simple fusion strategy in improving multi-modal feature representation and cross-modal information complementarity. However, in comparison to single-modal models, MM-RET has nearly double the number of trainable parameters, which raises the data demands and makes the model more susceptible to overfitting. In particular, performance on the PCV category remains suboptimal (F1 = 0.687). Compared to the MM-RET model, MM-RET-LoRA effectively reduces the number of trainable parameters while maintaining the benefits of multi-modal information fusion, leading to improved performance in classifying various types of AMD. For example, wetAMD classification improves notably from an F1-score of 0.641 to 0.805, suggesting that by reducing the parameter size and optimizing feature extraction, LoRA can effectively mitigate overfitting risks while improving recognition accuracy for complex pathological categories.

In the further optimization process, MM-RET-DCCA utilizes DCCA technology for the mapping and fusion of nonlinear multi-modal features. Compared to MM-RET-LoRA, DCCA further enhances performance across all categories by more effectively capturing the correlation between CFP and OCT data (e.g., wetAMD F1 rises to 0.857). As a result, the feature differences across various pathological categories become more distinct.

In this study, the proposed multi-modal model (Ours) integrates LoRA and DCCA technologies to maximize the potential of multi-modal data fusion, leading to a substantial performance enhancement in the AMD classification task. As shown in Table 1, the proposed model outperforms both single-modal and other multi-modal models in all AMD subtypes. Specifically, for the dryAMD, PCV, and wetAMD categories, its F1-scores reach 0.996, 0.897, and 0.899, respectively, indicating extremely high recognition accuracy. Statistical analysis results (Table 1 and Fig. 6) further support these findings. Specifically, the proposed method achieves an overall F1-score of 0.948 (95% CI, 0.930–0.966), an AUC-ROC of 0.991 (95% CI, 0.987–0.996), and an accuracy of 0.949 (95% CI, 0.932–0.966). Paired t-tests confirm that these improvements are statistically significant compared with all baseline models (all p<0.05). As shown in the confusion matrix in Fig. 5, the multi-modal models significantly reduce misclassification rates for complex pathological categories, with the MM-RET-LoRA and MM-RET-DCCA models excelling in enhancing the distinction between PCV and wetAMD. Ultimately, our model shows outstanding performance in the confusion matrix, with a significant reduction in misclassifications across all categories. The error rate for wetAMD classification is greatly reduced, and the confusion between PCV and wetAMD is nearly completely eliminated, validating the effectiveness and advantages of our multi-modal fusion framework in the AMD classification task.

Fig. 6.

Fig. 6

Comparison of performance metrics with 95% confidence intervals across different models. Significant differences in accuracy, AUC-ROC, and F1-score are observed, with the proposed multi-modal model consistently outperforming all baseline methods, demonstrating statistical significance and robustness

To further validate improvements in lesion localization accuracy and interpretability of our proposed multi-modal framework compared to single-modal models, we provide class activation map (CAM) [38] visualizations with annotated clinical ground truth (Fig. 7). Ground truth lesion regions, clinically identified by ophthalmologists, are clearly marked using dashed ellipses in the input images (Fig. 7a). Additionally, model predictions (correct or incorrect) are explicitly labeled beneath each CAM visualization, facilitating direct comparison.

Fig. 7.

Fig. 7

Examples of class activation map (CAM) visualizations for four retinal categories (normal, dryAMD, PCV, wetAMD). The input images of CFP and OCT are shown in a, with clinically annotated AMD-related lesion regions marked by dashed ellipses. CAM visualizations from the single-modal models (CFP-RET and OCT-RET) and the proposed multi-modal fusion model are shown in b and c, respectively. Labels beneath each pair of images in a represent clinical ground truth provided by ophthalmologists, while labels under each CAM visualization in b and c indicate model predictions (correct or incorrect ×). In CAM results, red indicates high attention to critical lesion areas, yellow moderate attention, and blue low attention. CFP images are converted to grayscale to enhance visual comparison

In the normal category, all models appropriately show minimal attention in non-lesion areas. However, notable differences appear in lesion categories. Single-modal models (CFP-RET and OCT-RET, Fig. 7b) present dispersed and often inaccurate attention regions that do not fully correspond with clinically annotated lesions. For instance, in dryAMD and PCV categories, these models’ attention areas frequently deviate from the clinically marked regions, leading to misclassifications, particularly noticeable in complex lesion types such as wetAMD.

Conversely, our proposed multi-modal model (Fig. 7c) demonstrates significantly improved interpretability by precisely and comprehensively aligning activation regions with clinical annotations. This improvement arises from the effective fusion of complementary features from CFP and OCT modalities, allowing the model to accurately localize lesions, including subtle pathological features such as early retinal degeneration in dryAMD, specific subretinal lesions in PCV, and nuanced vascular leakage in wetAMD. Consequently, this leads to markedly reduced misclassification and substantially enhanced clinical reliability.

The Impact of Different Components on Performance

To assess the contributions of the LoRA and DCCA components in the multi-modal AMD classification task, this study analyzes the performance changes by sequentially removing these components. Based on the data in Table 1, we compared the performance of the full model (Ours) with the performance of various ablation versions. Specifically, removing the LoRA module (MM-RET-DCCA) resulted in a 2.1% decrease in overall accuracy, suggesting that LoRA not only significantly reduces the training resource requirements but also plays a crucial role in performance optimization. Furthermore, removing the DCCA module (MM-RET-LoRA) resulted in a 4.7% drop in accuracy, highlighting the key role of DCCA in capturing the correlations between multi-modal information. The MM-RET model with both the LoRA and DCCA modules removed showed the greatest performance decline (12.2%), underscoring the importance of the synergistic effect between these two modules.

The t-SNE visualization [39] results (Fig. 8) provide a clear illustration of the distribution differences of the ablation models in the feature space. The full model (Ours) clearly separates the four retinal categories (normal, dryAMD, PCV, wetAMD), with distinct clustering boundaries for each category. Notably, it achieves excellent discrimination between the wetAMD (yellow) and PCV (green) categories. After removing the LoRA or DCCA module, the inter-class distribution becomes increasingly mixed, with a noticeable increase in the overlap between the wetAMD and PCV categories. This trend is particularly evident in the MM-RET model with both LoRA and DCCA components fully ablated, where the class boundaries become blurred, indicating a substantial decline in the model’s ability to extract fine-grained features without these key components.

Fig. 8.

Fig. 8

t-distributed stochastic neighbor embedding (t-SNE) visualization for demonstrating feature space distribution differences between our model and various ablation models. Purple represents the “normal” class, blue represents the “dryAMD” class, green represents the “PCV” class, and yellow represents the “wetAMD” class. Our model generates more distinct category boundaries

The Influence of the Oversampling Strategy

To assess the role of oversampling in mitigating class imbalance, we compare our proposed model (“Ours”) trained with and without a controlled augmentation-based replication approach for classes with fewer samples (dry AMD and PCV). As shown in Table 2, enabling oversampling yields an F1-score of 0.948, AUC-ROC of 0.991, and accuracy of 0.949, whereas disabling it leads to lower values (0.916, 0.983, and 0.918, respectively). This noticeable performance drop without oversampling aligns with observations made by Li et al. [36], who attribute similar declines to the model’s tendency to overemphasize majority-class information, thereby underutilizing critical features from minority classes. By replicating minority-class samples with moderate augmentations, we effectively address this imbalance, allowing the model to learn more robust representations for each class. Furthermore, the consistent gains across F1-score, AUC-ROC, and accuracy suggest that oversampling not only alleviates the risk of biased decision boundaries but also promotes a more stable training process, ultimately enhancing the overall classification performance.

Table 2.

Performance comparison of the proposed model (“Ours”) with and without oversampling strategy

Oversampling? F1-score AUC-ROC Accuracy
Yes 0.948 0.991 0.949
No 0.916 0.983 0.918

Bold values indicate the best performance metrics in each column

Comparison to the State-of-the-Art

As highlighted in “Introduction” section, only limited previous works have focused on multi-modal AMD categorization, notably Yoo et al. [13] and Wang et al. [14]. Yoo et al. employed a conventional framework, utilizing pre-trained CNNs for feature extraction followed by a random forest classifier. Wang et al. advanced this by proposing a two-stream CNN framework designed specifically for multi-modal feature fusion. To ensure a fair and direct comparison, we replicated both methods using the publicly available MMC-AMD dataset under identical experimental conditions.

Table 3 clearly shows that our proposed framework outperforms these state-of-the-art methods across multiple key pathological categories. Specifically, for dryAMD, PCV, and wetAMD, our model achieves significantly higher F1-scores (0.996, 0.897, and 0.899 respectively) compared to Yoo et al. (0.783, 0.648, and 0.736 respectively) and Wang et al. (0.929, 0.864, and 0.864 respectively). Furthermore, our method demonstrates a remarkable improvement in overall performance, achieving an overall F1-score of 0.948 and accuracy of 0.949, substantially higher than Yoo et al. (F1-score, 0.792; accuracy, 0.690) and Wang et al. (F1-score, 0.914; accuracy, 0.863). These results clearly underscore the superiority of our proposed approach.

Table 3.

Comparison with SOTA for multi-modal AMD categorization

Model F1-score Accuracy
normal dryAMD PCV wetAMD Overall
Yoo et al. [13] 1.000 0.783 0.648 0.736 0.792 0.690
Wang et al. [14] 1.000 0.929 0.864 0.864 0.914 0.863
Ours 0.999 0.996 0.897 0.899 0.948 0.949

Our proposed framework (integrating LoRA for efficient fine-tuning and DCCA for robust cross-modal feature fusion) is benchmarked against state-of-the-art approaches, including Yoo et al. [13]’s method (CNN-based feature extraction combined with a random forest classifier) and Wang et al. [14]’s two-stream CNN architecture, with performance evaluated across key pathological categories

Bold values indicate the best performance metrics in each column

Compared with Different Fine-Tuning Strategies

In this section, we thoroughly compare various fine-tuning strategies, including full parameter fine-tuning, layer freezing, and LoRA, to assess their effectiveness in the multi-modal AMD classification task. MM-RET-DCCA is used as the baseline model to systematically explore how these strategies impact the balance between model performance and computational efficiency. Table 4 provides a detailed summary of the number of training parameters, GPU memory usage, and model performance (including F1-score, AUC-ROC, and accuracy) for each strategy, while Figs. 9 and 10 offer further visual insights into the characteristics and effects of each approach.

Table 4.

Performance comparison of different fine-tuning strategies

Model Fine-tuning strategy Training parameters (M) GPU memory (GB) F1-score (%) AUC-ROC (%) Accuracy (%)
MM-RET-DCCA Full fine-tuning 609 20.21 92.55** 98.85** 92.75**
MM-RET-DCCA Freeze 5 layers 482 15.95 92.87** 98.92* 93.07**
MM-RET-DCCA Freeze 10 layers 356 14.20 93.66** 99.01* 94.16*
MM-RET-DCCA Freeze 15 layers 230 12.45 92.94** 98.97* 93.40*
MM-RET-DCCA Freeze 20 layers 104 10.80 92.32** 98.74** 92.58**
Ours LoRA 3 10.03 94.78 99.12 94.89

Based on the MM-RET-DCCA model, we sequentially applied full parameter fine-tuning, layer freezing, and LoRA methods, reporting the training parameter count, GPU memory usage, and average performance metrics across five training and evaluation runs. Statistical significance between the proposed LoRA strategy and other methods is evaluated using paired t-tests and indicated as follows: *p-value 0.05, **p-value 0.01

Bold values indicate the best performance metrics in each column

Fig. 9.

Fig. 9

Comparison of trained and untrained parameters across different fine-tuning strategies. Significant differences in the number of training parameters are observed, with LoRA significantly reducing the trained parameters through the low-rank adaptation method

Fig. 10.

Fig. 10

Trade-off between accuracy and GPU memory requirements across different fine-tuning strategies. The layer freezing strategy progressively increases the number of frozen layers, achieving the optimal balance between performance and resource consumption when 10 layers are frozen. In contrast, the LoRA strategy significantly reduces memory usage while maintaining optimal accuracy, showcasing its efficiency and exceptional performance under resource-constrained conditions

Full Parameter Fine-Tuning

In examining the impact of the full parameter fine-tuning strategy (frozen blocks = 0) on model performance, we found that the MM-RET-DCCA model in this case contains 609 M training parameters, resulting in a sharp increase in GPU memory usage to 20.21 GB. Although this strategy achieves 92.75% accuracy, 92.55% F1-score, and 98.85% AUC-ROC, indicating its effectiveness in global optimization, the large number of training parameters leads to a high risk of overfitting. Moreover, the feasibility of this strategy in practical applications is limited by both memory usage and computational complexity.

Layer Freezing Strategy

To highlight the balance between model parameter optimization and resource consumption, we systematically evaluated the effect of progressively increasing the number of frozen layers to reduce model complexity and resource demands. The experimental results show that when 5 layers are frozen, the number of training parameters decreases to 482 M, GPU memory usage reduces to 15.95 GB, and accuracy improves to 93.07%. Increasing the number of frozen layers to 10 further reduces the training parameters to 356 M and GPU memory usage to 14.20 GB. This configuration achieved an accuracy of 94.16%, an F1-score of 93.66%, and an AUC-ROC of 99.01%. By reducing the training parameters, this approach lowers the risk of overfitting while still preserving the effective representation of key features, thus achieving a balance between performance and resource efficiency.

However, when the number of frozen layers is increased to 15 and 20, the number of training parameters is reduced to 230 M and 104 M, and GPU memory usage decreases further to 12.45GB and 10.80GB, yet model accuracy drops to 93.40% when 15 layers are frozen and 92.58% when 20 layers are frozen. This trend indicates that excessive layer freezing may restrict the model’s ability to learn fine-grained features, which in turn affects performance. Thus, the configuration of freezing 10 layers achieves the optimal balance between performance and resource consumption.

LoRA Strategy

To further optimize training efficiency and address the potential performance decline in the layer freezing strategy, we introduced the LoRA method. This method significantly reduces the number of trainable parameters, reducing the total number of training parameters to 3 M, a 99.51% reduction compared to full parameter fine-tuning, while also significantly lowering GPU memory usage to only 10.03GB. This substantial reduction in training parameters did not affect the model’s performance. On the contrary, it achieved the highest accuracy of 94.89%, an F1-score of 94.78%, and an AUC-ROC of 99.12%. Paired t-tests further confirm that our LoRA strategy demonstrates statistically significant improvements in performance compared to other fine-tuning methods (all p<0.05). These results suggest that LoRA optimizes the balance between training efficiency and model performance by focusing on the most relevant parameters while preserving the core components of the pre-trained knowledge.

In summary, there are significant differences between the trained and untrained parameters across different fine-tuning strategies, as visually shown in Fig. 9. These differences further affect the performance of each fine-tuning strategy in the multi-modal AMD classification task. Although the full parameter fine-tuning strategy offers more comprehensive model optimization, its high resource consumption and overfitting risks limit its practicality in real-world applications. The layer freezing strategy balances model performance and resource consumption by controlling the number of training parameters, achieving the best results when freezing 10 layers. However, as the number of frozen layers increases, the model’s feature learning capacity becomes increasingly restricted, leading to a decrease in performance. In contrast, the LoRA technique introduced in this study significantly reduces the model’s computational requirements while preserving exceptional learning capability. Figure 10 visually demonstrates LoRA’s superior performance across various fine-tuning strategies, achieving optimal classification performance while maintaining low GPU memory usage, highlighting its immense potential in multi-modal AMD classification tasks.

Discussion

Despite significant progress in AMD classification using the multi-modal framework, several limitations still exist. Although the proposed integration of LoRA and DCCA effectively addresses critical challenges—namely, computational complexity and cross-modal feature heterogeneity—the current study primarily integrates existing individual methods. Therefore, future research could further advance the methodological innovation by exploring novel cross-modal fusion strategies, such as customized attention mechanisms or self-supervised learning explicitly tailored for multi-modal medical imaging scenarios.

Moreover, the annotation cost of multi-modal data remains high, limiting the scale and diversity of the datasets available for experiments. While the present dataset validates the core conclusions, the stability and robustness of the proposed model may vary when applied to external datasets obtained from different imaging devices. Thus, future work should focus on further validating the generalization capability and robustness of the model in clinical practice by extending its application to various hospitals and healthcare environments. For instance, prospective studies involving multi-center collaborations can evaluate model performance across diverse clinical workflows and imaging equipment, ensuring its robustness and applicability in real-world clinical settings. Additionally, domain-adaptive techniques may be explored to maintain model performance consistency across institutions.

In addition, our current framework performs classification based solely on a single OCT slice paired with a CFP image. However, clinical practice typically involves OCT volumetric data, where ophthalmologists analyze multiple key B-scans simultaneously for accurate diagnosis. To enhance clinical applicability and diagnostic consistency, future studies might consider investigating three-dimensional OCT data processing approaches, such as fusing multiple B-scan images and integrating specific scoring aggregation strategies.

Furthermore, while our framework demonstrates promising potential for cross-task generalization, its effectiveness across a broader spectrum of retinal diseases has not yet been fully assessed. Future studies are encouraged to validate the framework across multiple retinal conditions and disease states, thus broadening its clinical applicability. Validation of the proposed approach using independent clinical cohorts and prospective studies could also further solidify its practical relevance and reliability in real-world medical applications.

Conclusion

In this study, an efficient multi-modal deep learning framework is proposed, successfully extending the single-modal retinal foundation model of a pre-trained model to a multi-modal AMD classification task. By integrating LoRA technology, computational complexity is significantly reduced during the fine-tuning process, achieving performance comparable to, or even surpassing, that of full-parameter fine-tuning with optimization of only a minimal number of parameters. The application of DCCA effectively addresses the heterogeneity between the CFP and OCT modalities, enabling efficient fusion of cross-modal features. Experimental results on the publicly available MMC-AMD dataset exhibit exceptional performance, significantly outperforming existing single-modality and multi-modal baseline models, especially in the recognition of complex pathological categories.

Despite the promising results demonstrated by our proposed multi-modal framework, several potential challenges remain when translating this model into real-world clinical applications. For instance, variations in imaging quality across different devices, the necessity for simultaneous acquisition of multiple modalities, and seamless integration into existing clinical workflows could pose practical difficulties. To address these challenges, future research should focus on developing robust data preprocessing and calibration methods, as well as conducting prospective multi-center clinical trials, thereby ensuring the reliability and robustness of the model in diverse clinical environments. We anticipate that this study will stimulate further exploration and application of multi-modal deep learning in medical diagnosis, advancing the translation and implementation of related technologies in clinical practice.

Author Contribution

The author contributions are as follows. Baochen Zhen: conceptualization, formal analysis, investigation, methodology, software, validation, visualization, writing—original draft. Yongbin Qi: data curation, formal analysis, visualization. Zizhen Tang: formal analysis, visualization, writing—original draft. Chaoyong Liu: investigation, software, validation. Shilin Zhao: investigation, visualization. Yansuo Yu: conceptualization, funding acquisition, project administration, resources, supervision, writing—review and editing. Qiang Liu: funding acquisition, resources, writing—review and editing.

Funding

This work was supported by the fund of Beijing Municipal Education Commission (No. 22019821001) and the National College Students Innovation and Entrepreneurship Training Program (No. 2025J00105).

Data Availability

The data are available at https://forms.gle/jJT6H9N9CY34gFBWA.

Declarations

Ethics Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Consent to Participate

Not applicable

Consent for Publication

Not applicable

Conflict of Interest

The authors declare no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Wong, W.L., Su, X., Li, X., Cheung, C.M.G., Klein, R., Cheng, C.-Y., Wong, T.Y.: Global prevalence of age-related macular degeneration and disease burden projection for 2020 and 2040: a systematic review and meta-analysis. The Lancet Global Health 2(2), 106–116 (2014) [DOI] [PubMed] [Google Scholar]
  • 2.DeWan, A., Liu, M., Hartman, S., Zhang, S.S.-M., Liu, D.T., Zhao, C., Tam, P.O., Chan, W.M., Lam, D.S., Snyder, M., et al.: Htra1 promoter polymorphism in wet age-related macular degeneration. Science 314(5801), 989–992 (2006) [DOI] [PubMed] [Google Scholar]
  • 3.Wong, C.W., Yanagi, Y., Lee, W.-K., Ogura, Y., Yeo, I., Wong, T.Y., Cheung, C.M.G.: Age-related macular degeneration and polypoidal choroidal vasculopathy in asians. Progress in retinal and eye research 53, 107–139 (2016) [DOI] [PubMed] [Google Scholar]
  • 4.Lee, J.E., Shin, J.P., Kim, H.W., Chang, W., Kim, Y.C., Lee, S.J., Chung, I.Y., Lee, J.E.: Efficacy of fixed-dosing aflibercept for treating polypoidal choroidal vasculopathy: 1-year results of the vault study. Graefe’s Archive for Clinical and Experimental Ophthalmology 255, 493–502 (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.De Fauw, J., Ledsam, J.R., Romera-Paredes, B., Nikolov, S., Tomasev, N., Blackwell, S., Askham, H., Glorot, X., O’Donoghue, B., Visentin, D., et al.: Clinically applicable deep learning for diagnosis and referral in retinal disease. Nature medicine 24(9), 1342–1350 (2018) [DOI] [PubMed] [Google Scholar]
  • 6.Burlina, P., Freund, D.E., Joshi, N., Wolfson, Y., Bressler, N.M.: Detection of age-related macular degeneration via deep learning. In: 2016 IEEE 13th International Symposium on Biomedical Imaging (ISBI), pp. 184–188 (2016). IEEE
  • 7.Burlina, P.M., Joshi, N., Pekala, M., Pacheco, K.D., Freund, D.E., Bressler, N.M.: Automated grading of age-related macular degeneration from color fundus images using deep convolutional neural networks. JAMA ophthalmology 135(11), 1170–1176 (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Grassmann, F., Mengelkamp, J., Brandl, C., Harsch, S., Zimmermann, M.E., Linkohr, B., Peters, A., Heid, I.M., Palm, C., Weber, B.H.: A deep learning algorithm for prediction of age-related eye disease study severity scale for age-related macular degeneration from color fundus photography. Ophthalmology 125(9), 1410–1420 (2018) [DOI] [PubMed] [Google Scholar]
  • 9.Vannadil, N., Kokil, P.: Automated age-related macular degeneration diagnosis in retinal fundus images via vit. In: International Conference on Machine Learning, Deep Learning and Computational Intelligence for Wireless Communication, pp. 271–280 (2012). Springer
  • 10.Lee, C.S., Baughman, D.M., Lee, A.Y.: Deep learning is effective for classifying normal versus age-related macular degeneration oct images. Ophthalmology Retina 1(4), 322–327 (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Karri, S.P.K., Chakraborty, D., Chatterjee, J.: Transfer learning based classification of optical coherence tomography images with diabetic macular edema and dry age-related macular degeneration. Biomedical optics express 8(2), 579–592 (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Treder, M., Lauermann, J.L., Eter, N.: Automated detection of exudative age-related macular degeneration in spectral domain optical coherence tomography using deep learning. Graefe’s Archive for Clinical and Experimental Ophthalmology 256, 259–265 (2018) [DOI] [PubMed] [Google Scholar]
  • 13.Yoo, T.K., Choi, J.Y., Seo, J.G., Ramasubramanian, B., Selvaperumal, S., Kim, D.W.: The possibility of the combination of oct and fundus images for improving the diagnostic accuracy of deep learning for age-related macular degeneration: a preliminary experiment. Medical & biological engineering & computing 57, 677–687 (2019) [DOI] [PubMed] [Google Scholar]
  • 14.Wang, W., Li, X., Xu, Z., Yu, W., Zhao, J., Ding, D., Chen, Y.: Learning two-stream cnn for multi-modal age-related macular degeneration categorization. IEEE Journal of Biomedical and Health Informatics 26(8), 4111–4122 (2022) [DOI] [PubMed] [Google Scholar]
  • 15.Razzak, M.I., Naz, S., Zaib, A.: Deep learning for medical image processing: Overview, challenges and the future. Classification in BioApps: Automation of decision making, 323–350 (2018)
  • 16.Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee
  • 17.Zhou, Y., Chia, M.A., Wagner, S.K., Ayhan, M.S., Williamson, D.J., Struyven, R.R., Liu, T., Xu, M., Lozano, M.G., Woodward-Court, P., et al.: A foundation model for generalizable disease detection from retinal images. Nature 622(7981), 156–163 (2023) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
  • 19.Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
  • 20.Andrew, G., Arora, R., Bilmes, J., Livescu, K.: Deep canonical correlation analysis. In: International Conference on Machine Learning, pp. 1247–1255 (2013). PMLR
  • 21.Harold, H.: Relations between two sets of variables. Biometrika 28(3), 321–377 (1936) [Google Scholar]
  • 22.Kanagasingam, Y., Bhuiyan, A., Abràmoff, M.D., Smith, R.T., Goldschmidt, L., Wong, T.Y.: Progress on retinal image analysis for age related macular degeneration. Progress in retinal and eye research 38, 20–42 (2014) [DOI] [PubMed] [Google Scholar]
  • 23.Burlina, P., Pacheco, K.D., Joshi, N., Freund, D.E., Bressler, N.M.: Comparing humans and deep learning performance for grading amd: a study in using universal deep features and transfer learning for automated amd analysis. Computers in biology and medicine 82, 80–86 (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Zhang, H., Qie, Y.: Applying deep learning to medical imaging: a review. Applied Sciences 13(18), 10521 (2023) [Google Scholar]
  • 25.Kermany, D.S., Goldbaum, M., Cai, W., Valentim, C.C., Liang, H., Baxter, S.L., McKeown, A., Yang, G., Wu, X., Yan, F., et al.: Identifying medical diagnoses and treatable diseases by image-based deep learning. cell 172(5), 1122–1131 (2018) [DOI] [PubMed]
  • 26.Shen, Y., et al.: Artificial intelligence system for automated glaucoma detection via deep learning. Ophthalmology 127(11), 1625–1632 (2020)33222773 [Google Scholar]
  • 27.Playout, C., Duval, R., Boucher, M.C., Cheriet, F.: Focused attention in transformers for interpretable classification of retinal images. Medical Image Analysis 82, 102608 (2022) [DOI] [PubMed] [Google Scholar]
  • 28.Li, X., Jia, M., Islam, M.T., Yu, L., Xing, L.: Self-supervised feature learning via exploiting multi-modal data for retinal disease diagnosis. IEEE Transactions on Medical Imaging 39(12), 4023–4033 (2020) [DOI] [PubMed] [Google Scholar]
  • 29.Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 (2018)
  • 30.Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? Advances in neural information processing systems 27 (2014)
  • 31.Vrbančič, G., Podgorelec, V.: Transfer learning with adaptive fine-tuning. IEEE Access 8, 196197–196211 (2020) [Google Scholar]
  • 32.Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. Advances in neural information processing systems 28 (2015)
  • 33.Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for nlp. In: International Conference on Machine Learning, pp. 2790–2799 (2019). PMLR
  • 34.Devlin, J.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
  • 35.Jintasuttisak, T., Intajag, S.: Color retinal image enhancement by rayleigh contrast-limited adaptive histogram equalization. In: 2014 14th International Conference on Control, Automation and Systems (ICCAS 2014), pp. 692–697 (2014). IEEE
  • 36.Li, X., Zhou, Y., Wang, J., Lin, H., Zhao, J., Ding, D., Yu, W., Chen, Y.: Multi-modal multi-instance learning for retinal disease recognition. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 2474–2482 (2021)
  • 37.Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143(1), 29–36 (1982) [DOI] [PubMed] [Google Scholar]
  • 38.Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)
  • 39.Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of machine learning research 9(11) (2008)

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data are available at https://forms.gle/jJT6H9N9CY34gFBWA.


Articles from Journal of Imaging Informatics in Medicine are provided here courtesy of Springer

RESOURCES