Skip to main content
Communications Medicine logoLink to Communications Medicine
. 2025 Dec 27;6:32. doi: 10.1038/s43856-025-01293-9

Compact vision language models enable efficient and interpretable optical coherence tomography through layer-specific multimodal learning

Tania Haghighi 1, Sina Gholami 1, Jared Todd Sokol 2, Aayush Biswas 1, Jennifer I Lim 3, Theodore Leng 2, Atalie C Thompson 4, Hamed Tabkhi 1, Minhaj Nur Alam 1,
PMCID: PMC12816039  PMID: 41455821

Abstract

Background

Translating the intricate anatomical signatures of retinal disease from optical coherence tomography (OCT) B-scans into clear, accurate clinical narratives demands algorithms that seamlessly fuse visual features with domain expertise.

Methods

We curated a multimodal dataset of 40,000 OCT B-scans from public repositories and private clinical cohorts, each paired with expert-validated summaries spanning six conditions: diabetic macular edema, diabetic retinopathy, geographic atrophy, drusen, choroidal neovascularization, and healthy retina. We introduce LO-VLM, a compact (247M parameter) vision-language model (VLM) that infuses anatomical guidance into both encoder and decoder for free-form summary generation and multiclass disease classification. Benchmarking against state-of-the-art RetinaVLM, LLaVA-Med, and a ViT vision only model demonstrates superior performance.

Results

In a blinded evaluation by three board certified retina specialists, LO-VLM narratives achieves a mean = 8.5 (standard deviation = 1.15) out of 10, compared to a mean = 5.5 (standard 32 deviation = 1.13) for RetinaVLM (p < 0.0001). In quantitative evaluations, LO-VLM achieves an SBERT similarity of 80.3% and a BERTScore F1 of 71.5%, representing improvements of 8.2% and 28.8% over specialized VLM baselines. For disease classification, LO-VLM reaches 96% accuracy (F1 = 96%), outperforming ViT by 13% and exceeding medical VLM benchmarks by over 62% (p < 0.05).

Conclusions

By reconciling interpretability with computational efficiency, LO-VLM establishes a paradigm for efficient AI models in OCT interpretation.

Subject terms: Diagnostic markers, Diagnostic markers, Tomography, Eye abnormalities, Vision disorders

Plain language summary

Eye diseases such as diabetic retinopathy and macular degeneration can cause vision loss if not detected early. Doctors use a type of imaging called optical coherence tomography (OCT) to examine the layers of the retina, but interpreting the images obtained can be time-consuming and requires specialist training. In this study, we developed LO-VLM, an artificial intelligence system that learns from both images and descriptions written by experts to enable analysis of OCT scans. Using 40,000 examples, it can accurately identify different retinal diseases and generate clear, human readable summaries. LO-VLM achieves high accuracy and produces reports clinicians find useful. This technology could support faster, more consistent eye disease diagnosis and provide interpretation similar to that seen by specialists in routine and remote care settings.


Wang et al at introduce a dynamic, network-based tool to predict the risk of hepatic encephalopathy in patients with cirrhosis. The model integrates expert knowledge with data-driven insights to enable early prediction and risk stratification across diverse patient populations.

Introduction

Medical imaging and artificial intelligence (AI) tools have become indispensable in ophthalmic diagnostics. Recent successes in machine learning (ML) and deep learning (DL), coupled with the advent of unimodal and multimodal foundational models, have demonstrated expert level performance in tasks ranging from lesion detection to anatomical segmentation14. In the current clinical setting, optical coherence tomography (OCT)5 has become the definitive modality for non-invasive retinal imaging, providing cross-sectional views of retinal microstructures with micrometer-scale resolution6. By employing low-coherence interferometry, OCT enables visualization of individual layers in vivo. These high-resolution images are indispensable for diagnosing and monitoring ocular pathologies, including age-related macular degeneration (AMD), diabetic retinopathy (DR), and glaucoma79. Despite high fidelity for retinal disease diagnosis, manual assessment of OCT volumes can be both labor-intensive and susceptible to variability between observers. Two clinicians may produce divergent measurements of retinal thickness or disagree on the presence of subtle hyperreflective changes. Moreover, the unequal distribution of trained retina specialists in rural or resource-limited regions exacerbates diagnostic delays and potentially compromises patient outcomes. Automated tools that can pre-screen OCT scans, quantify relevant biomarkers, or prioritize cases for specialist review are therefore critical to optimizing clinical workflows and expanding access to timely eye care1012.

To meet this demand, recent advances in DL have demonstrated considerable promise for OCT analysis. Convolutional neural networks (CNNs) have been employed to classify OCT volumes according to disease state, distinguishing for example, between normal retinas and those exhibiting signs of diabetic macular edema (DME)13,14. CNNs have also been used to segment retinal layers in order to derive quantitative metrics such as central retinal thickness. These methods rely exclusively on image-based feature extraction1518. Despite achieving high performance on benchmark datasets, most existing approaches produce only a categorical diagnosis or a delineation of anatomical boundaries.

Notably, the ability to generate textual descriptions of retinal abnormalities similar to a narrative dictated by an expert clinician offers several advantages. Natural language summaries render AI outputs more interpretable and actionable1921. Text-based outputs can also be integrated directly into electronic medical record systems, streamlining documentation and reducing transcription errors. In regions with limited specialist availability, a model capable of producing reliable, human-readable interpretations may serve as a decision support tool for general practitioners or optometrists. Therefore, there has been a growing interest in vision-language models (VLMs) pre-trained on extensive image-text corpora. By employing cross-modal pre-training strategies, these frameworks jointly optimize visual and linguistic encoders, aligning image features and textual semantics within a unified space. Such integrated representations can enhance diagnostic performance and facilitate the generation of concise, domain-specific reports that are readily interpretable by clinicians. In recent years, researchers have demonstrated the utility of VLMs in several medical domains, especially in radiological applications22,23. Building upon cross-modal pre-training strategies, ophthalmic VLMs such as RetinaVLM24 and VisionUnite3 have been developed to generate narrative interpretations of retinal images. However, the considerable computational burden and limited pathological scope of these models underscore the necessity for an OCT-specific ophthalmic VLM that combines computational efficiency with the capacity to characterize the full spectrum of retinal pathologies in OCT imaging.

In this project, we present Layer-wise OCT VLM (LO-VLM), a domain-specific VLM that employs layer-specific multimodal learning to achieve efficient and interpretable OCT analysis for retinal disease diagnosis. LO-VLM jointly trains an image encoder and a text decoder optimized for accurately classifying retinal pathologies and quantifying essential biomarkers at individual retinal layers, all within a lightweight, computationally efficient framework tailored to the unique characteristics of OCT imaging. We address the dual challenges of ophthalmic multimodal data scarcity and computational efficiency in automated OCT interpretation through three key contributions. First, creation of a comprehensive layer-specific OCT image-text dataset: we construct a comprehensive dataset of 40,000 OCT image-text pairs featuring layer-specific anatomical summaries that enable detailed retinal structure analysis and pathological characterization. This dataset addresses the critical gap in paired multi-modal data for OCT interpretation and provides a foundation for training VLMs in this specialized domain.

Second development of a compact VLM for multi-class retinal disease diagnosis: we show that our compact VLM is capable of generating multi-class diagnoses by integrating both image and text inputs, rather than relying solely on image-based analysis during training. We achieve high diagnostic accuracy across several disease categories, compared to image only model.

Third, generation of clinically interpretable layer specific descriptions: we demonstrate that our adapted model produces clinically meaningful layer-specific anatomical descriptions that enhance interpretability and facilitate clinician validation of automated diagnosis. These descriptions can help physicians better understand subtle irregularities of the retina, thereby supporting them in both diagnosis and the composition of accurate clinical reports. With the recent demonstration of home OCTs and adoption of OCT imaging in primary care hospitals25,26, tools like LO-VLM also have utility in resource-constraint environments or point-of-care settings where an ophthalmologist is not present to provide differential diagnosis with enhanced interpretability. Through extensive quantitative and qualitative evaluations via clinician feedback, we show that the generated summaries align closely with expert annotations and support transparent human-readable reporting.

Central to LO-VLM is a layer-wise prompting strategy that injects explicit anatomical priors into the vision-language alignment stage. By forcing the model to attend to specific retinal layers, we (i) reduce semantic ambiguity and label noise, (ii) encourage fine-grained cross modal alignment at clinically meaningful loci, and (iii) achieve comparable representational power with a 247M parameter backbone instead of multi-billion parameter generic/medical VLMs. This design choice explains LO-VLM’s superior data efficiency and interpretability beyond raw accuracy gains.

Methods

This study establishes the effectiveness of layer-specific multimodal training in LO-VLM, a compact 247M-parameter VLM for OCT interpretation. To enable anatomically grounded understanding of retinal scans, we curated a paired image-text dataset consisting of 40,000 expert-reviewed, layer-specific OCT image-text pairs sourced from diverse public and private clinical repositories. The LO-VLM architecture jointly optimizes a vision encoder and a text decoder. The model was trained for multimodal alignment to support both retinal disease classification and the generation of layer-specific clinical summaries. Comparative evaluations were performed against a ViT-Base baseline and leading OCT specialized VLMs, including RetinaVLM24 and LLaVA-Med27. Additionally, focused ablation studies were conducted to quantify the contribution of multimodal training and dataset design to model interpretability and diagnostic performance.

Dataset curation

For ablation analyses, two modality specific datasets were curated: one comprising 172,000 spectral domain OCT B-scans and the other, EYE-lit, containing approximately 766,000 textual excerpts. The OCT repository was assembled from multiple public and private sources and encompasses both normal retinal anatomy and a broad spectrum of pathologic presentations. Details of the OCT datasets are presented in Table 1. The text-only corpus draws from two complementary sources: standard ophthalmology textbooks and peer-reviewed PubMed abstracts28. Textbook passages provide structured and didactic explanations of disease mechanisms and clinical terminology, while research articles contribute original findings, detailed study results, and emerging insights into imaging biomarkers and therapeutic strategies.

Table 1.

Overview of OCT datasets with source, description, and diagnostic categories

Dataset Description Type
Wake Forest dataset A collection of high-resolution OCT images from patients with GA, classified into central, non-central GA, and no GA, including images from multiple visits. Private
UIC dataset OCT images categorized into Control, Mild, Moderate, and Severe DR stages. Captured using an Optovue system with 3-mm, 6-mm, and 8-mm fields-of-view. Private
Waterloo dataset55 Over 500 high-resolution OCT images including Normal, Macular Hole, AMD, and Central Serous Retinopathy. Public
Srinivasan dataset36 High-resolution OCT scans from patients with DME and AMD. Public
OCTDL dataset56 More than 2000 high-resolution OCT images encompassing AMD, DME, Epiretinal Membrane, Retinal Artery Occlusion, Retinal Vein Occlusion, and VMI Disease. Public
Kermany dataset11 207,130 validated OCT images, categorized into CNV, DME, Drusen, and Normal. Public

UIC and Wake Forest datasets were retrospectively accessed. Dr. Lim (UIC) and Dr. Thompson (Wake Forest) have IRB exemption approved to access retrospective data.

Image-text pair dataset

The image-text pair dataset was curated by assembling 40,000 OCT images from three sources: 27,000 scans from the public Kermany et al. dataset11, 7200 scans from a private UIC collection, and 5700 scans from a Wake Forest cohort (Fig. 1).

Fig. 1. Overview of dataset pipeline and composition.

Fig. 1

In a, raw OCT B-scans are processed by an AI assistant to generate free-text descriptions of pathologies found in OCT volume data, which are then reviewed by a clinician and assigned to five retinal layers. b Illustrates the distribution of OCT scans by diagnostic category and by data source. The accompanying table summarizes the number of samples, public availability, and annotation status for each dataset. c Summarizes the word count statistics of the descriptions and presents a word cloud of the most frequently used terms, emphasizing key retinal features.

For each OCT B-scan a structured textual annotations were generated by prompting GPT-4o to describe each retinal layer29(see Supplementary Methods); these outputs were then reviewed by a manual grader to correct overgeneralizations and anatomical inaccuracies and to enforce consistent terminology. From the full set of OCT identifiable layers five were selected, nerve fiber layer (NFL), ganglion cell layer (GCL), inner plexiform layer (IPL), outer plexiform layer (OPL), and photoreceptor inner/outer segment (IS/OS) and additional information of other layers if needed. These five layers consistently exhibit key diagnostic markers, such as subretinal fluid at the IS/OS interface in choroidal neovascularization (CNV) and drusen deposits in the OPL, whereas layers beyond this set provide limited discriminative value for diagnosing the diseases in our datasets, which are Normal, Drusen, CNV, DME, DR, and geographic atrophy (GA) cases. Each prompt concatenated the diagnostic label with these five-layer specific descriptors, ensuring that annotations explicitly referenced anatomically relevant regions. Subsequently, we split the dataset into training and testing subsets containing 39,000 and 1000 samples, respectively, while maintaining identical category proportions across both splits to preserve class balance. This rigorously structured, multimodal curation of OCT images and layer-aware text provided the foundation for multimodal training, allowing the model to learn both global diagnostic cues and localized, layer-specific features essential for accurate retinal disease characterization.

Model architecture and training

For the architecture design of LO-VLM, we adopted an encoder-decoder backbone derived from the BLIP30 (Bootstrapping Language-Image Pre-training) framework, which integrates a ViT encoder with a Transformer-based text encoder-decoder framework. In this setup, the ViT processes the input image as a grid of patch tokens to generate a high-dimensional visual embedding. Simultaneously, the text decoder produces language outputs conditioned on these visual features. During training, BLIP is optimized using three complementary objectives:

  • Image-text matching (ITM): A cross-modal transformer block incorporating bi-directional self-attention, cross-attention, and feed-forward layers is trained to classify whether a given image-text pair is semantically aligned.

  • Image-text contrastive (ITC): A dual encoder contrastive loss function draws matched image and text embeddings closer within a shared latent space while repelling mismatched pairs.

  • Image-grounded language modeling (LM): A causal language modeling objective trains the decoder to auto-regressive generate captions, token-by-token, based on the encoded visual representation.

This modular architecture supports both retrieval based tasks (via ITM and ITC) and generative captioning (via LM), as illustrated in Fig. 2.

Fig. 2.

Fig. 2

LO-VLM framework: visual embeddings extracted from OCT B-scans are jointly optimized with curated, layer specific text summaries via three complementary objectives.

We trained our VLM to translate OCT B-scans into anatomically precise, layer-aware retinal descriptions by using structured prompts that combined each scan’s diagnostic label (e.g., “CNV,” “Drusen”) with its five layer-specific descriptors (NFL, GCL, IPL, OPL, IS/OS). Each training instance comprises a grayscale OCT scan, and a text prompt formatted as: “<LayerName>: clinical summary of observed pathology or normal anatomy.” This formulation injects anatomical priors into the learning process (Fig. 1). In each update, a given retinal layer-level segmented B-scan is first processed by a ViT encoder to obtain a dense visual embedding encapsulating both global retinal morphology and subtle layer-wise reflectivity patterns. This embedding is shared across the original training objectives to generate detailed, anatomically grounded summaries such as “IS/OS: intact, continuous” or “IPL: focal hyperreflectivity indicating edema.”

The retina layer-text matching (RLTM) objective assesses whether each scan-summary pair is correctly aligned, with positive examples linking scans to their respective layer descriptions. This promotes fine-grained visual-textual alignment at the anatomical layer level. The retina-layer contrastive alignment (RLCA) objective further refines this alignment by drawing matched scan-summary embeddings closer in a shared latent space while separating mismatched pairs, thereby guiding the encoder toward anatomically salient features. Finally, the layer-conditioned language modeling (LCLM) objective conditions captioning on the visual representation of each segmented scan, encouraging the decoder to produce clinically precise, anatomically grounded summaries initiated by the corresponding layer identifier.

For training the LO-VLM model, we followed the hyperparameter configuration outlined in the original baseline framework, with targeted adjustments. A batch size of 32 was selected as the maximum capacity supported by our 49GB GPU, and the model was trained for 50 epochs. For captionign task, the maximum output length was set to 256 tokens, which was empirically sufficient to capture detailed, layer-wise retinal descriptions. The training dataset consisted of 39,000 OCT scan-summary pairs, with 1000 samples reserved for testing. This configuration enabled the development of a clinically grounded framework capable of producing anatomically precise summaries from OCT B-scans.

Baseline comparisons

We began our evaluation by benchmarking medical and general-purpose VLMs on internal and external datasets. Specifically, we evaluated: (i) RetinaVLM, the state-of-the-art OCT specialized summary generator; (ii) LLaVA-Med, a broadly pre-trained medical VLM with demonstrated OCT summarization skills (Fig. 3), and (iii) PaLI-Gemma 2, a 3-billion parameter general purpose VLM, and GIT-large31 as a compact general purpose model. Following prior work24,27, RetinaVLM and LLaVA-Med were evaluated using their respective recommended prompt templates, as shown in Fig. 3. For consistency, we also employed the same diagnostic prompt used with LLaVA-Med when evaluating PaLI-Gemma, asking whether it detected any abnormalities in the OCT images. However, PaLI-Gemma only generated generic captions such as “OCT image of the left eye” without providing any structural or diagnostic information. GIT-large was even less effective, producing irrelevant captions such as “the lens is shown in the image.” or “a close up of a tooth in a mouth” and failing to recognize the input as an OCT image. This structured comparison allowed us to untangle the contributions of domain specificity, in-domain training, and sheer model scale to automated OCT interpretation.

Fig. 3. Compares the outputs of LO-VLM, RetinaVLM and LLaVA-Med on the same OCT B-scan.

Fig. 3

For these comparisons, LLaVA-Med was prompted with “###Human: What biomarkers or abnormalities, if any, are present in the provided OCT image? ###Assistant:'', and RetinaVLM was prompted with “Describe the OCT image focusing on Biomarkers and abnormalities.”

To understand just how much textual grounding adds, we then trained a purely visual ViT-Base classifier on the same dataset and evaluated it on the six classes diagnostic task: CNV, DME, DR, GA, drusen, and Normal cases. By directly contrasting its performance against the three VLMs, we could quantify the concrete benefit that pairing images with structured text brings to OCT classification.

Furthermore, as part of our ablation studies, we evaluated a general-purpose VLM, PaLI-Gemma 2 (3B parameters), which we trained on our OCT dataset to adapt it to the OCT domain32. In parallel, we systematically reduced the number of training examples for our model to assess data efficiency, analyzing the impact on clinical summary quality across varying data scales.

Vision only baseline: ViT-base training on OCT images

VLMs have recently gained attention for their ability to generate descriptive reports and support decision making based on visual data. To benchmark this paradigm against traditional image classification, we trained a standalone ViT-Base model solely on OCT scans, 39,000 images for training and 1000 for evaluation, excluding any accompanying text information. Each B-scan was uniformly resized to 224 × 224 pixels, and the pretrained 768-dimensional classifier head was replaced with a linear layer projecting to our six disease categories. The network was optimized with cross-entropy loss using the Adam algorithm (learning rate =1 × 10−5), a batch size of 32, and a 50-epoch schedule. This unimodal baseline establishes a clear benchmark for quantifying the added value of language grounding and cross-modal alignment in our LO-VLM system. ViT-Base was selected to mirror the vision encoder of our LO-VLM framework, ensuring architectural consistency so that any performance differences can be attributed solely to the addition of textual information. As a widely adopted transformer-based backbone, ViT-Base benefits from extensive pretraining on large scale image corpora and has proven effective when trained on medical imaging tasks, including OCT classification9,33.

Specialized and foundation VLMs

We compared our results to RetinaVLM, which is the only VLM specifically trained for OCT interpretation. This baseline employs an 8B parameter Llama 3 backbone paired with a ResNet vision encoder and was originally trained on a proprietary OCT dataset that was subsequently extended using an LLM assistant; this augmented dataset was then used to train the model. Beyond this specialized baseline, we evaluated a medical Model, LLaVA-Med (7B params). LLaVA-Med is a medical image-captioning model trained on diverse imaging modalities, including OCT. The trained “microsoft/llava-med-v1.5-mistral-7b” model was retrieved from the Hugging Face repository and its performance was assessed using the developer recommended Mistral Instruct conversation template. In accordance with the original RetinaVLM publication, we then employed the patient-triage prompt that most closely aligns with RetinaVLM’s layer-wise summary (Fig. 3, more samples are provided in Fig. S1).

Token specific gradient saliency for anatomical mapping

We computed language-based saliency maps to highlight which parts of the retinal image were most influential for specific textual outputs of the VLM. We employed Grad-CAM, a gradient-based attribution technique commonly used to identify the regions of an image that drive the prediction of a classifier. To adapt this method for generative VLM, we defined the target output as the sum of logits over the tokens corresponding to a chosen phrase. This reframes phrase generation as a classification-style problem, allowing Grad-CAM to identify image regions most responsible for the appearance of that phrase in the model’s report. Grad-CAM was then applied by backpropagating gradients of this scalar through the vision encoder’s intermediate features, producing heatmaps that localize the image regions most relevant to the model’s textual description (Fig. 4)34,35.

Fig. 4.

Fig. 4

Regions contoured by LO-VLM with respect to frequent words repeated in the clinical summary dataset.

These maps are intended as explanatory aids rather than as segmentation masks. Whereas segmentation models explicitly delineate biomarkers, the saliency maps provide qualitative insight into how textual descriptions are anchored in the image, offering a practical way to assess whether generated reports are grounded in clinically relevant evidence. Pretrained attention weights were also considered, but they did not yield meaningful explanations, underscoring the value of gradient-based saliency in providing interpretable and clinically useful visualizations.

Enhancing readability with EYE-Llama

In clinical practice and previous ophthalmology studies, narrative paragraphs have been widely adopted because they enhance readability and conform to established reporting conventions. Although LO-VLM’s raw layer-wise outputs exhibit superior diagnostic accuracy, their tabular format may be less accessible to clinicians. To integrate the diagnostic precision of LO-VLM with the narrative clarity of traditional reporting, we apply EYE-Llama, a domain-specialized, instruction-tuned Llama 2 model, as a post-processing module that reformats structured predictions into concise, clinically accurate paragraphs28. Given the structured layer descriptions and diagnostic labels produced by LO-VLM, we invoke EYE-Llama with the following prompt template:

You are an ophthalmic AI assistant. Given the structured findings below, compose a concise, clinically accurate paragraph describing an OCT image suitable for a specialist audience.

{Retina layer-wise information}

please write a cohesive paragraph integrating these points without adding any new information.”

To verify that this rewriting step did not introduce or omit information, we randomly audited 100 LO-VLM outputs after EYE-Llama reformatted them. A non-expert rater compared each paragraph to the original layer-wise list and found that 87 of 100 contained no added or removed clinical facts (95% CI: 0.79–0.93).

Transforming LO-VLM’s structured outputs into natural language summaries via EYE-Llama (Fig. 5) yields a cohesive narrative that improves readability, retains diagnostic precision through ophthalmology-specific terminology, and produces paragraph-formatted text ready for direct inclusion in clinical records, scholarly manuscripts, or patient education materials.

Fig. 5. EYE-Llama enhances readability by converting LO-VLM’s structured layer-wise outputs into coherent, clinically formatted summaries.

Fig. 5

The highlighted regions illustrate consistent findings between EYE-Llama and LO-VLM, with matching colors indicating corresponding interpretations.

Results

Our findings confirm that layer-specific multimodal training enables LO-VLM to achieve superior diagnostic performance and interpretability compared to existing VLMs. The model reached 96% disease classification accuracy, an improvement of 13% over vision only ViT-Base, while delivering state-of-the-art semantic similarity (SBERT = 0.803; BERT-F1 = 0.715). It received significantly higher clinician ratings (mean 8.49 ± 0.88 vs. 5.41 ± 0.89; P < 0.0001) and maintained>80% accuracy with only 248 training examples, underscoring both its interpretability and data efficiency.

Quantitative evaluation

We have defined two distinct tasks to showcase the models ability in providing retinal layer description and disease classification.

Task 1—Retinal Layer Description: In task 1, each model’s ability to generate precise, layer specific anatomical descriptions is assessed via a suite of automated language metrics reported in the upper panel of Table 2. The results demonstrate that LO-VLM produces markedly more faithful and clinically relevant textual summaries than prior VLM approaches.

Table 2.

Captioning performance for Task 1 (OCT layer description) across models, reported using SBERT similarity, BERTScore (Precision, Recall, F1), BLEU-4, Smoothed BLEU, ROUGE-L, and CIDEr

Model #Params SBert sim. Bert Prec. Bert Rec. Bert F1 BLEU-4 Smooth BLEU Rouge-l CIDEr
Task 1: -Retina layer description
LLaVA-Med 7B 72.1% 46% 37.1% 41.1% 0.0 0.0 0.01 0.0
RetinaVLM 8B 71.4% 47.0% 39.4% 42.7% 0.0 0.0 0.02 0.0
LO-VLM (trained encoder-decoder) 247M 79.9% 70.1% 70.3% 70.8% 0.02 0.04 0.2 0.04
LO-VLM 247M 80.3% 72.0% 71.3% 71.5% 0.02 0.04 0.2 0.03
Retina layer description on AMD only (GA, Drusen, Normal)
LLaVA-Med 7B 69.4% 46.0% 37.2% 41.1% 0.0 0.0 0.01 0.0
RetinaVLM 8B 71.3% 46.6% 38.9% 42.3% 0.0 0.0 0.03 0.0
LO-VLM 247M 81.2% 71.3% 71.2% 71.7% 0.02 0.05 0.2 0.03
Retina layer description on external dataset36
LLaVA-Med 7B 75.1% 45.6% 33.2% 38.4% 0.0 0.0 0.01 0.0
RetinaVLM 8B 74.7% 43.4% 40.8% 41.9% 0.0 0.0 0.05 0.0
LO-VLM 247M 83.2% 66.6% 62.0% 64.0% 0.01 0.03 0.14 0.03

The table is divided into three sections: (1) overall captioning results across models, (2) AMD-specific evaluation (GA, Drusen, Normal), and (3) evaluation on an external dataset. All reported differences were found statistically significant using paired bootstrap resampling of per-image metric differences (N = 10; p < 0.05). Highest scores are shown in bold, and second highest are underlined.

Task 2—Disease Classification: Task 2 benchmarks two large medical VLMs and a vision only ViT-Base model on retinal disease multi-class classification against LO-VLM and its variant (trained encoder-decoder). Table 3 (lower panel) details the classification metrics for each method.

Table 3.

Disease classification performance (Task 2) across vision-language and vision-only models, reported using accuracy, F1-score, precision, and recall

Model #Params Captioning Accuracy F1-score Precision Recall
Task 2: Disease -classification
LLaVA-Med 7B Yes 17% 4% 2% 14.3%
RetinaVLM 8B Yes 16% 7% 6% 16%
VIT-base 86M No 83.1% 82.9% 85.0% 83.1%
LO-VLM (trained encoder-decoder) 247M Yes 84.7% 72.8% 73.4% 72.6%
LO-VLM 247M Yes 96.0% 96.0% 96.2% 96.0%
Disease classification on AMD only (GA, Drusen, Normal)
LLaVA-Med 7B Yes 28.5% 11.8% 8.2% 21.4%
RetinaVLM 8B Yes 31.1% 32% 32% 31%
LO-VLM 247M Yes 94.9% 96.4% 98.1% 94.9%
Disease classification on external dataset36
LLaVA-Med 7B Yes 33.3% 13% 8% 25%
RetinaVLM 8B Yes 32% 7% 8% 14%
LO-VLM 247M Yes 80.0% 41.8% 45.2% 40.0%

The table is organized into three sections: (1) overall disease classification results, (2) AMD-specific evaluation across GA, Drusen, and Normal categories, and (3) external dataset evaluation. All reported differences were found statistically significant using paired bootstrap resampling of per-image metric differences (N = 10; p < 0.05). Highest scores are shown in bold, and second highest are underlined.

AMD-focused Analysis: In addition, we performed an AMD-focused analysis to conduct a fair comparison between our model and RetinaVLM (trained exclusively on AMD). We constructed a balanced test subset comprising Normal, Drusen (an early stage of AMD), and GA (a late stage of AMD) cases (166 samples per class) and re-evaluated all methods.

External Dataset Evaluation: To further assess the generalizability of the proposed models, a critical aspect in clinical AI applications, we conducted an external validation using the Srinivasan OCT dataset36, which includes images from three diagnostic categories: AMD, DME, and Normal. A balanced subset of 400 samples per class was selected for evaluation. Because our training dataset does not include a single unified AMD label, we treated predictions of GA or Drusen as correct when evaluating AMD cases, since these represent early and late manifestations of the same disease continuum to ensure a fair cross-dataset comparison. The results of this evaluation are reported in Table 2 for task 1 and Table 3 for task 2.

Task evaluation metrics

To comprehensively evaluate model performance across all tasks, we employed a suite of language generation and classification metrics. For Task 1, we adopted both lexical and semantic similarity metrics to evaluate the alignment between generated outputs and reference descriptions. Prior to any metric computation, we applied a targeted cleaning procedure to both the model’s outputs and the corresponding reference annotations. Specifically, we stripped out all retina layer headings, fixed labels such as “Nerve Fiber Layer”, “Ganglion Cell Layer”, and so forth, because these headings are structural markers rather than descriptive content. By removing them, we prevent evaluation metrics from being overly influenced by repeated, non-informative tokens that would artificially inflate scores. To capture semantic fidelity in the clinical domain, we employed SBERT37 similarity using the abhinand/MedEmbed-large-v0.1 model from Hugging Face, a Sentence-BERT model pre-trained on large scale biomedical corpora38. Leveraging a medical domain SBERT model enables more accurate embedding of domain specific terminology and contextual relationships, thereby providing a more faithful assessment of semantic equivalence than general purpose models.

We further report BERT based precision, recall, and F1-score39, which measure contextual token-level alignment using deep transformer embeddings. These metrics are particularly useful in recognizing valid paraphrasing and terminological variation that traditional n-gram-based methods may overlook. To assess surface-level textual similarity, we also include BLEU40, Smooth BLEU41, ROUGE-L42, and CIDEr43 scores. BLEU and its smoothed variant evaluate n-gram precision, while ROUGE-L measures the longest common subsequence, capturing both content overlap and sequential alignment. Although CIDEr was originally developed for image captioning tasks with multiple references, we apply it here in a single-reference setting. In this case, CIDEr behaves similarly to BLEU but incorporates TF-IDF weighting to emphasize rare, informative n-grams. This weighting scheme provides a more nuanced assessment of content relevance, particularly beneficial in medical contexts where specific terminology carries high informational value.

For Task 2 and AMD-focused analysis, we report accuracy, precision, recall, and F1-score, which are standard metrics for multi-class classification. These collectively assess the model’s overall correctness, its ability to identify true positive cases, its sensitivity to relevant instances, and the harmonic balance between precision and recall.

Qualitative evaluation

We randomly selected 100 OCT images from our held-out test set, preserving the same class distribution as in our training data, and conducted a blinded, expert grading study against RetinaVLM, the only existing OCT-specialized VLM. Three retina specialists independently reviewed the layer-specific descriptions and diagnostic predictions produced by each model. Each grader scored every response on a 0–10 scale, with mean ± SD, median, and interquartile range reported in Table 4, according to three criteria: (i) Correctness: accuracy of the findings and diagnosis, identifying the right pathologies without errors; (ii) Completeness: inclusion of all relevant abnormalities, covering every key layer and feature; and (iii) Clarity: readability and coherence using precise, logical language that’s easy for specialists to follow.

Table 4.

Qualitative assessment of OCT VLMs based on physician ratings (0–10 Scale)

Model # Params Mean ± SD Score range (Min–Max) Median IQR # Samples
RetinaVLM 8B 5.54 ± 1.13 4-7 6 2 100
LO-VLM 247M 8.49 ± 1.15 7-10 9 2 100

The difference between models was statistically significant (paired t-test, P < 0.0001).

Ablation studies

Domain specific pre-training investigation

We investigated whether separate pre-training of vision and text components on domain-specific but unpaired data could improve performance. The ViT encoder was pre-trained on 172,000 unlabeled OCT scans using self-supervised learning objectives, using masked image modeling (MIM)44. In MIM, a random subset of image patches is masked out during training and the network must reconstruct the missing regions from the visible context. This forces the model to learn richer, high level features of retinal structure and layer boundaries. By predicting masked regions based solely on the surrounding pixels, the ViT encoder develops an implicit understanding of OCT specific image statistics and anatomical priors. We randomly masked 60% of each OCT B-scan (preserving the top and bottom 20%) so that the model’s reconstruction loss is driven almost entirely by the central region, where the stratified retinal layers (e.g., nerve fiber, inner plexiform, and photoreceptor layers) are most clearly delineated.

In addition to the vision encoder, we pre-trained text decoder45 on 765,000 ophthalmology text documents to capture domain specific terminology and linguistic patterns relevant to retinal disease descriptions in the language part of the VLM. This approach aimed to leverage the substantial amounts of unimodal medical data available in the public domain to enhance model understanding of OCT imaging characteristics and ophthalmic terminology. Following independent pre-training, the vision and text components were integrated and trained on our paired dataset using identical training configurations as the baseline models (Fig. 6). Through evaluation of the results and comparison between LO-VLM (trained encoder-decoder) and LO-VLM (Table 1), we observe that staged unimodal adaptation fails to enrich clinical feature representations or improve performance on either descriptive or classification tasks. Instead, LO-VLM (trained encoder-decoder) requires substantially more training time and large domain-specific datasets without delivering measurable gains. In contrast, direct end-to-end multimodal training of LO-VLM on paired OCT image-text examples consistently yields superior results, confirming that joint cross-modal optimization is both the most effective and the most efficient strategy for clinical OCT interpretation.

Fig. 6. Overview of the self-supervised and multimodal training pipeline for LO-VLM.

Fig. 6

a Self-supervised pre-training. Top: masked image modeling is applied to five unlabeled OCT datasets to train a Vision Transformer (ViT) backbone. Bottom: the Eye-lit corpus–comprising online articles, textbooks, and scientific abstracts–is used to train a BERT language model for domain-specific text representations. b Multimodal training. The pretrained ViT and BERT are jointly tuned on paired OCT image-text examples to produce the LO-VLM model, enabling unified visual and textual understanding of retinal scans.

Data efficiency analysis

We conducted systematic experiments to determine the minimum data requirements for effective OCT adaptation. Training datasets ranged from 100 to 39,000 image-text pairs, this comprehensive scaling analysis enables precise characterization of data efficiency and identifies optimal training set sizes for different deployment scenarios where data availability may be constrained.

To establish comprehensive benchmarks for data efficiency, we evaluated BLIP and PaLI-Gemma, a larger scale model trained via Low-Rank Adaptation (LoRA)46. For PaLI-Gemma, only 267 million of its 3 billion parameters were trained by inserting low rank decomposition matrices into the attention layers while freezing the remaining weights. This approach closely matches BLIP’s parameter count to enable a fair comparison and to assess whether the larger model can better cope with limited training data46. Figure 7a, b illustrates the performance of models trained with dataset sizes ranging from 100 to 39,000 image-text pairs over 50 training epochs. As shown in Fig. 7a, models trained on fewer than 500 pairs display marked variability and frequent outliers, underscoring the difficulty of achieving stable image-text alignment from limited supervision. In contrast, increasing the number of training pairs results in progressive gains in accuracy and F1, accompanied by narrower performance distributions. Figure 7b demonstrates that SBERT similarity and BERT F1 stabilize rapidly when training with 248 pairs or more, with trajectories remaining largely unchanged after the initial epochs. By comparison, the 100-pair model exhibits unstable and fluctuating dynamics before partially converging, reflecting the greater sensitivity of low-resource training to optimization noise (more metrics are provided in Fig. S2). These findings indicate that robust cross-modal alignment in LO-VLM depends primarily on the availability of sufficient training data, with performance stabilizing even when trained on relatively small datasets.

Fig. 7. Evaluation of LO-VLM model performance.

Fig. 7

a Disease classification performance (F1 and Accuracy) across training steps. b Captioning quality (BERT-F1 and SBERT similarity) across different amounts of input data. c Performance of LO-VLM vs. PaLI-Gemma across sample sizes (248, 15,000, 39,000). In all panels, N in “LO-VLM-N” denotes the number of training samples used for each model configuration

Discussion

Most existing VLMs are trained on general purpose datasets such as MSCOCO and Flickr30k, which emphasize natural scenes and everyday objects and therefore lack the capacity to represent submicron retinal microstructures or to articulate pathological features in precise ophthalmic terminology. Although several medical VLMs have emerged, they are designed to span multiple imaging modalities rather than specialize in OCT. For instance, LLaVA-Med, a 7B parameter medical VLM that includes OCT among its training modalities, yields only marginal classification improvements (17% and 28%) and records an SBERT similarity of 72.1% alongside a BERTScore F1 of 41.1%. Recent work in ophthalmic VLM has largely concentrated on fundus photography imaging4750, focusing on generating clinical summaries, answering diagnostic questions, and improving lesion interpretation from two-dimensional retinal photographs. Although existing methods have improved fundus photography analysis, OCT provides cross-sectional insight into retinal layers, revealing fine-grained morphological and pathological changes that are crucial for early identification of disease. While RetinaVLM is currently the only model specifically designed for free form OCT summary generation, incorporating an 8-billion-parameter backbone for AMD staging, referral support, and biomarker validation. However, it achieves only 16% accuracy on our six-class evaluation, reflecting its limited pathology diversity. To ensure a fair comparison, we restricted the test set to the two AMD-related categories, Drusen, GA, and Normal cases, under which RetinaVLM’s accuracy improves to 32%. Nevertheless, it still falls significantly short of LO-VLM, which maintains approximately 96% F1 score on both the full six-class task and the AMD only subset. Moreover, RetinaVLM’s generated summaries exhibit lower textual alignment, with an SBERT similarity of 71.4% and a BERTScore F1 of 42.7%, underscoring its limited generalizability, a common challenge among medical models. Other OCT-focused approaches include EyeClip51 and MM-Retinal52, which have been trained on paired OCT image-text data but do not support free-form clinical summary generation; notably, only GCS-M3VLT53 has the ability to analyze and generate OCT interpretations, which has withheld its code and pre-trained weights, precluding independent replication. These findings underscore three core limitations of existing medical VLMs: their poor generalizability due to limited exposure to diverse retinal conditions during training, their inability to generate free-form OCT interpretations, and their reliance on excessively large parameter counts that demand substantial computational resources. These constraints emphasize the need for efficient, OCT-specific VLMs that can produce clinically meaningful outputs across a wide range of pathologies.

In this study, we present a comprehensive evaluation of LO-VLM through both quantitative benchmarks and qualitative expert assessment, demonstrating its strong performance across retinal image understanding tasks. For Task 1 (Retinal Layer Description) in our qualitative evaluation, a blinded review was performed in which three board-certified retina specialists independently assessed anonymized OCT summaries generated by LO-VLM and RetinaVLM. Each specialist scored 100 generated summaries using a standardized rubric, with scores averaged across graders for each sample and summarized. LO-VLM’s outputs received a mean score of 8.5 ± 1.15 compared to 5.4 ± 1.13 for RetinaVLM (p < 0.0001), underscoring clinicians’ strong preference for LO-VLM’s anatomically grounded narratives. In our quantitative evaluation, LO-VLM outperformed LLaVA-Med and RetinaVLM across all metrics, including SBERT, BERTScore, BLEU, Smooth BLEU, ROUGE-L, and CIDEr, in both the six-class setting and the AMD only subset (see Table 2). For Task 2 (Disease Classification), our model again outperforms all baseline VLMs in identifying retinal diseases. In this setting, we also compare our model against its own vision only component using a ViT-Base backbone with a classification head. In the six-class evaluation, LO-VLM achieved 96% accuracy (F1 = 96%), whereas the vision only ViT-Base baseline reached 83.1% accuracy (F1 = 82.9%). This performance gain is attributed to anatomical priors that guide the model’s attention toward clinically salient regions, such as drusen within the OPL and subretinal fluid at the photoreceptor interface, thereby reducing misclassification of subtle pathological manifestations.

To examine cross-dataset performance, we evaluated LO-VLM using an external Srinivasan dataset, as medical models often face challenges in transferring knowledge across datasets. In this evaluation, LO-VLM’s classification accuracy decreased from 96% on our internal test set to 80%, reflecting the expected domain shift between datasets. Nevertheless, it substantially outperformed RetinaVLM (32%) and LLaVA-Med (33%), achieving markedly higher accuracy as well as superior language similarity metrics in anatomical description generation. Specifically, LO-VLM achieved an SBERT similarity of 83.2%, compared with 74.7% for RetinaVLM and 75.1% for LLaVA-Med. Although the external dataset primarily consists of AMD cases, which is the focus of RetinaVLM, LO-VLM maintained superior performance, underscoring its robust cross-domain adaptability and clinically meaningful generalization.

We further compared data efficiency between LO-VLM and the PaLI-Gemma model. When trained with only 248 examples, the 3B-parameter PaLI-Gemma model failed entirely at classification, achieving 0% across all classification metrics, and underperformed in captioning with a BERTScore F1 of just 46%. In contrast, LO-VLM achieved 83% across classification metrics and a BERTScore F1 of 72% under the same low-data regime. While PaLI-Gemma’s performance improved with more data, these results highlight the superior data efficiency of LO-VLM and the importance of task-specific inductive bias. Beyond these performance metrics, the pronounced data efficiency of LO-VLM offers further insight into its practical utility. Achieving near-peak classification and summary fidelity with only 248 paired examples suggests that anatomical prompts serve as strong guiding signals, effectively reducing the need for large volumes of training data. Figure 7c illustrates how LO-VLM’s accuracy and language similarity metrics rapidly approach saturation, with diminishing gains beyond 15,000 examples. This plateau informs annotation strategy: rather than scaling purely by quantity, future efforts should prioritize diversity capturing rare pathologies, varied device settings, and demographic heterogeneity to further bolster generalizability. Moreover, the low data regime success opens the door to cost-effective deployment in resource-limited clinics and for less common disease subtypes, where large annotated corpora are infeasible. Active learning techniques could be integrated to identify the most informative OCT-text pairs, optimizing the annotation budget by focusing on edge cases and under-represented conditions.

To achieve this performance, however, we faced two primary challenges in constructing an ophthalmic VLM for OCT interpretation. The first challenge is data scarcity. Large scale paired OCT image and text corpora, where each OCT volume is linked to an expert written description, are not publicly available. Extensive repositories of OCT images and ophthalmology literature exist independently, but the paucity of curated multimodal datasets constrains the development of models that can reason over both visual and linguistic modalities. The second challenge is that existing VLMs are large scale, general-purpose models that lack the anatomical specificity needed for OCT interpretation and impose computational demands that make them impractical for clinical use. In this study, we address this gap by introducing a compact, OCT-specialized VLM with the following contributions: (i) Anatomy-guided prompt integration, (ii) Targeted adaptation outperforms specialized VLMs, (iii) Ablation of unimodal pre-training versus end-to-end alignment, (iv) Data efficiency scaling analysis (v) Clinical validation with seamless report generation.

Leveraging structured, layer-specific anatomical cues, LO-VLM injects visual representations directly into the encoder and decoder attention layers to generate detailed, clinically oriented OCT summaries grounded in retinal layer visual features. This compact, domain-guided adaptation significantly outperforms much larger OCT-specific and general medical VLMs in producing coherent and accurate summaries, despite RetinaVLM and LLaVA-Med possessing billions of parameters. Through ablation experiments, we demonstrate that a unified, end-to-end cross-modal alignment on paired OCT image-text data consistently yields higher summary fidelity and interpretive clarity than a two-stage pipeline where independent vision and language pre-training introduces additional computational overhead without commensurate benefits (Tables 2 and 3).

Despite these advances, our study has several limitations. Our dataset, although extensive, originates from a limited number of institutions and may not capture the full variability of OCT devices, scanning protocols, or rare pathologies encountered in broader clinical practice. Moreover, our evaluation was confined to six disease categories and five retinal layers, leaving finer pathological subtypes and alternative OCT modalities such as OCT angiography or longitudinal volumetric analyses unexplored. Future work should integrate real-world clinical documentation and comprehensive patient metadata to enrich contextual understanding, and enlist a larger, more diverse cohort of human evaluators to rigorously assess generalizability and interpretability across varied clinical settings. Finally, extending the LO-VLM framework to incorporate complementary ophthalmic imaging modalities, including fundus photography and OCT angiography, would enable simultaneous assessment of structural and vascular biomarkers, thereby enhancing diagnostic and prognostic performance.

The development of automated OCT interpretation tools raises important ethical considerations. While this work is intended solely for research purposes and is not designed for clinical deployment, ensuring the protection of patient privacy remains essential even in research contexts. This necessitates rigorous de-identification protocols and strong data governance. Furthermore, when considering potential applications, it is crucial to emphasize that such systems are meant to assist clinical expertise, not replace it. Any future use would therefore require careful evaluation, appropriate training, and thorough oversight.

This work represents a step toward building clinically meaningful VLMs tailored to the unique demands of ophthalmic imaging. By focusing on anatomical grounding, interpretability, and efficiency, LO-VLM demonstrates how targeted model design can overcome the limitations of large-scale general-purpose architectures in specialized medical domains. As research in medical VLMs continues to evolve, bridging the gap between domain-specific insight and scalable learning frameworks will be key to advancing AI systems that are not only performant, but also clinically valuable, and data efficient.

Statistical analysis

For both text-generation (Table 2) and classification (Table 3) evaluations, statistical significance was assessed using two-sided paired bootstrapping. Per-image metric differences between models were resampled 10 times with replacement, and the resulting empirical distributions were used to estimate two-sided p-values. A significance threshold of p < 0.05 was applied to determine meaningful differences across models. All models were evaluated on the same held-out images to ensure a like-for-like comparison.

Human evaluation

For human evaluation (#samples = 100 OCT images; 3 graders), each grader scored every generated summary using the same rubric. Scores were averaged across graders for each sample, and these per-sample averages were summarized as mean ± SD across the 100 clinical summaries.

Hallucination audit

We sampled 100 LO-VLM → EYE-Llama narratives and manually checked whether any facts were added or removed during paragraph reformatting. In 87 of 100 cases, no factual changes were observed (exact binomial 95% CI: 0.79–0.93).

Software

Analyses were performed in Python (Pytorch, Transformers, numpy/pandas, scipy, matplotlib, and seaborn libraries).

Supplementary information

Acknowledgements

This work was supported by NEI R15EY035804, R21EY035271 (M.N.A.), and UNC Charlotte Faculty Research Grant (M.N.A.).

Author contributions

T.H. and M.N.A. conceived and designed the study; H.T. and M.N.A. supervised the study. T.H. and S.G. curated and annotated the dataset; J.T.S. and T.L. evaluated the results of the model; J.I.L. and A.C.T. provided clinical expertise and reviewed annotations; T.H. trained and developed the LO-VLM model; T.H. and S.G. analyzed results; T.H. drafted the manuscript; all authors reviewed and edited the manuscript.

Peer review

Peer review information

Communications Medicine thanks Kohilan Gananandan and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Data availability

The datasets generated and/or analyzed during the current study are partially available. The curated layer-specific OCT image-text dataset comprises 40,000 paired B-scans and textual descriptions collected from three sources: 27,000 scans from the public Kermany dataset, 7200 scans from the institutional UIC collection, and 5700 scans from the Wake Forest cohort. The public portion of the dataset (27,000 image-text pairs) is available in the Hugging Face repository at https://huggingface.co/datasets/QIAIUNCC/OCT-summary-Dataset54. Public OCT repositories referenced in this study, such as Kermany et al.11, OCTID55, Srinivasan et al.36, and OCTDL56 are available at their original archives. The numerical source data underlying the charts in Fig. 7 of the main article and Fig. S2 are provided in the Supplementary Data.

Code availability

All custom code used for data preprocessing, model training, and evaluation is publicly available at github.com/QIAIUNCC/LO-VLM_Layer-wise_OCT_Vision_Language_Model and archived 10.5281/zenodo.1738142057.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

The online version contains supplementary material available at 10.1038/s43856-025-01293-9.

References

  • 1.Gholami, S. et al. Distributed training of foundation models for ophthalmic diagnosis. Commun. Eng.4, 6 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images. Nature622, 156–163 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Li, Z. et al. VisionUnite: a vision–language foundation model for ophthalmology enhanced with clinical knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 47, 11848–11862 (2025). [DOI] [PubMed]
  • 4.Wang, C., Chen, X., Ning, H. & Li, S. Sam-octa: a fine-tuning strategy for applying foundation model octa image segmentation tasks. In Proc.ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 1771–1775 (IEEE, 2024).
  • 5.Huang, D. et al. Optical coherence tomography. Science254, 1178–1181 (1991). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Keane, P. A. & Sadda, S. R. Retinal imaging in the twenty-first century: state of the art and future directions. Ophthalmology121, 2489–2500 (2014). [DOI] [PubMed] [Google Scholar]
  • 7.Drexler, W. & Fujimoto, J. G. State-of-the-art retinal optical coherence tomography. Prog. Retin. Eye Res.27, 45–88 (2008). [DOI] [PubMed] [Google Scholar]
  • 8.Hee, M. R. et al. Optical coherence tomography of age-related macular degeneration and choroidal neovascularization. Ophthalmology103, 1260–1270 (1996). [DOI] [PubMed] [Google Scholar]
  • 9.Shin, H., Jeon, S., Seol, Y., Kim, S. & Kang, D. Vision transformer approach for classification of Alzheimer’s disease using 18F-florbetaben brain images. Appl. Sci.13, 3453 (2023). [Google Scholar]
  • 10.De Fauw, J. et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat. Med.24, 1342–1350 (2018). [DOI] [PubMed] [Google Scholar]
  • 11.Kermany, D. S. et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell172, 1122–1131 (2018). [DOI] [PubMed] [Google Scholar]
  • 12.Subhedar, J. & Mahajan, A. A review on recent work on oct image classification for disease detection. In Proc. 2022 OPJU International Technology Conference on Emerging Technologies for Sustainable Development (OTCON) 1–6 (IEEE, 2023).
  • 13.Schlegl, T. et al. Fully automated detection and quantification of macular fluid in OCT using deep learning. Ophthalmology125, 549–558 (2018). [DOI] [PubMed] [Google Scholar]
  • 14.Venhuizen, F. G. et al. Automated staging of age-related macular degeneration using optical coherence tomography. Investig. Ophthalmol. Vis. Sci.58, 2318–2328 (2017). [DOI] [PubMed] [Google Scholar]
  • 15.Roy, A. G. et al. Relaynet: retinal layer and fluid segmentation of macular optical coherence tomography using fully convolutional networks. Biomed. Opt. Express8, 3627–3642 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Li, Q. et al. Deepretina: layer segmentation of retina in oct images using deep learning. Transl. Vis. Sci. Technol.9, 61–61 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Apostolopoulos, S., De Zanet, S., Ciller, C., Wolf, S. & Sznitman, R. Pathological OCT retinal layer segmentation using branch residual U-shape networks. In Proc.International Conference on Medical Image Computing and Computer-Assisted Intervention 294–301 (Springer, 2017).
  • 18.He, Y. et al. Topology guaranteed segmentation of the human retina from oct using convolutional neural networks. Preprint at 10.48550/arXiv.1803.05120 (2018).
  • 19.Jing, B., Xie, P. & Xing, E. On the automatic generation of medical imaging reports. Proc. 56th Annu. Meet. Assoc. Comput. Linguist.1, 2577–2586 (2018).
  • 20.Li, M. et al. Cross-modal clinical graph transformer for ophthalmic report generation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 20656–20665 (IEEE, 2022).
  • 21.Huang, J.-H. et al. Deepopht: medical report generation for retinal images via deep models and visual explanation. In Proc. Winter Conference on Applications of Computer Vision 2442–2452 (IEEE, 2021).
  • 22.Limbu, M. & Banerjee, D. Medblip: fine-tuning blip for medical image captioning. Preprint at 10.48550/arXiv.2505.14726 (2025).
  • 23.Tanno, R. et al. Collaboration between clinicians and vision–language models in radiology report generation. Nat. Med.31, 599–608 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Holland, R. et al. Specialized curricula for training vision–language models in retinal image analysis. NPJ Digit. Med.8, 532 (2025). [DOI] [PMC free article] [PubMed]
  • 25.Schneider, E. W. et al. Pivotal trial toward effectiveness of self-administered oct in neovascular age-related macular degeneration. report 2-artificial intelligence analytics. Ophthalmol. Sci.5, 100662 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Liu, X. et al. Evaluation of an oct-AI–based telemedicine platform for retinal disease screening and referral in a primary care setting. Transl. Vis. Sci. Technol.11, 4–4 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Li, C. et al. Llava-med: training a large language-and-vision assistant for biomedicine in one day. Adv. Neural Inf. Process. Syst.36, 28541–28564 (2023). [Google Scholar]
  • 28.Haghighi, T. et al. EYE-Llama: an in-domain large language model for ophthalmology. iScience28 (2025). [DOI] [PMC free article] [PubMed]
  • 29.OpenAI. Gpt-4o system card (2024). Accessed: 2025-07-17.
  • 30.Li, J., Li, D., Xiong, C. & Hoi, S. Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proc.International Conference On Machine Learning 12888–12900 (PMLR, 2022).
  • 31.Wang, J. et al. Git: a generative image-to-text transformer for vision and language. Preprint at 10.48550/arXiv.2205.14100 (2022).
  • 32.Steiner, A. et al. Paligemma 2: a family of versatile vlms for transfer. Preprint at 10.48550/arXiv.2412.03555 (2024).
  • 33.Akça, S., Garip, Z., Ekinci, E. & Atban, F. Automated classification of choroidal neovascularization, diabetic macular edema, and drusen from retinal oct images using vision transformers: a comparative study. Lasers Med. Sci.39, 140 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. Preprint at 10.48550/arXiv.1312.6034 (2013).
  • 35.Selvaraju, R. R. et al. Grad-cam: visual explanations from deep networks via gradient-based localization. In Proc. IEEE International Conference on Computer Vision 618–626 (IEEE, 2017).
  • 36.Srinivasan, P. P. et al. Fully automated detection of diabetic macular edema and dry age-related macular degeneration from optical coherence tomography images. Biomed. Opt. Express5, 3568–3577 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Reimers, N. & Gurevych, I. Sentence-bert: sentence embeddings using siamese bert-networks. Preprint at 10.48550/arXiv.1908.10084 (2019).
  • 38.Chakraborty, A. Medembed: Large biomedical embeddings for semantic search. https://huggingface.co/abhinand/MedEmbed-large-v0.1 (2024).
  • 39.Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. Bertscore: evaluating text generation with Bert. Preprint at 10.48550/arXiv.1904.09675 (2019).
  • 40.Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Proc. 40th Annual Meeting of the Association for Computational Linguistics 311–318 (ACM, 2002).
  • 41.Chen, B. & Cherry, C. A systematic comparison of smoothing techniques for sentence-level BLEU. In Proc. 9th Workshop on Statistical Machine Translation 362–367 (2014).
  • 42.Lin, C.-Y. Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, 74–81 (2004).
  • 43.Vedantam, R., Lawrence Zitnick, C. & Parikh, D. Cider: Consensus-based image description evaluation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 4566–4575 (IEEE, 2015).
  • 44.He, K. et al. Masked autoencoders are scalable vision learners. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 16000–16009 (IEEE, 2022).
  • 45.Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Volume 1 (long and short papers) 4171–4186 (2019).
  • 46.Hu, E. J. et al. Lora: Low-rank adaptation of large language models. ICLR1, 3 (2022). [Google Scholar]
  • 47.Silva-Rodriguez, J., Chakor, H., Kobbi, R., Dolz, J. & Ayed, I. B. A foundation language-image model of the retina (flair): encoding expert knowledge in text supervision. Med. Image Anal.99, 103357 (2025). [DOI] [PubMed] [Google Scholar]
  • 48.Chen, X. et al. Ffa-gpt: an automated pipeline for fundus fluorescein angiography interpretation and question-answer. npj Digit. Med.7, 111 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Shao, A. et al. Generative artificial intelligence for fundus fluorescein angiography interpretation and human expert evaluation. npj Digit. Med.8, 396 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Wang, M. et al. Enhancing diagnostic accuracy in rare and common fundus diseases with a knowledge-rich vision-language model. Nat. Commun.16, 5528 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Shi, D. et al. Eyeclip: a visual-language foundation model for multi-modal ophthalmic image analysis. Preprint at 10.48550/arXiv.2409.06644 (2024).
  • 52.Wu, R. et al. Mm-retinal v2: transfer an elite knowledge spark into fundus vision-language pretraining. Preprint at 10.48550/arXiv.2501.15798 (2025).
  • 53.Cherukuri, T. K., Shaik, N. S., Bodapati, J. D. & Ye, D. H. Gcs-m3vlt: guided context self-attention based multi-modal medical vision language transformer for retinal image captioning. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 1–5 (IEEE, 2025).
  • 54.QIAIUNCC. Oct-summary-dataset. https://huggingface.co/datasets/QIAIUNCC/OCT-summary-Dataset (2025).
  • 55.Gholami, P., Roy, P., Parthasarathy, M. K. & Lakshminarayanan, V. Octid: optical coherence tomography image database. Comput. Electr. Eng.81, 106532 (2020). [Google Scholar]
  • 56.Kulyabin, M. et al. Octdl: optical coherence tomography dataset for image-based deep learning methods. Sci. Data11, 365 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Haghighi, T. et al. Lo-vlm: Layer-wise oct vlm github repository. https://github.com/QIAIUNCC/LO-VLM (2025).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The datasets generated and/or analyzed during the current study are partially available. The curated layer-specific OCT image-text dataset comprises 40,000 paired B-scans and textual descriptions collected from three sources: 27,000 scans from the public Kermany dataset, 7200 scans from the institutional UIC collection, and 5700 scans from the Wake Forest cohort. The public portion of the dataset (27,000 image-text pairs) is available in the Hugging Face repository at https://huggingface.co/datasets/QIAIUNCC/OCT-summary-Dataset54. Public OCT repositories referenced in this study, such as Kermany et al.11, OCTID55, Srinivasan et al.36, and OCTDL56 are available at their original archives. The numerical source data underlying the charts in Fig. 7 of the main article and Fig. S2 are provided in the Supplementary Data.

All custom code used for data preprocessing, model training, and evaluation is publicly available at github.com/QIAIUNCC/LO-VLM_Layer-wise_OCT_Vision_Language_Model and archived 10.5281/zenodo.1738142057.


Articles from Communications Medicine are provided here courtesy of Nature Publishing Group

RESOURCES