Abstract
Optical Coherence Tomography (OCT) is essential in ophthalmology for cross-sectional imaging of the retina. Pretrained foundation models facilitate task-specific model development by enabling fine-tuning with limited labeled data. However, current foundation models rely on a single B-scan (usually the central slice), overlooking volumetric context. This research investigates video foundation models to capture full 3D retinal structure and improve diagnostic performance. V-JEPA, a state-of-the-art video foundation model, was benchmarked against retinal foundation models (RETFound, VisionFM) and a natural image foundation model (DINOv2). All were fine-tuned to detect Age-related Macular Degeneration or Glaucomatous Optic Neuropathy using five OCT datasets. V-JEPA consistently equaled or outperformed image-based models, achieving an average AUROC of 0.94 (0.80–0.99), versus 0.90 (0.76–0.98) for the best image model, a statistically significant improvement (p < 0.001). To our knowledge, this is the first application of transformer-based video models to volumetric OCT, highlighting their promise in 3D medical imaging.
Subject terms: Eye diseases, Diagnostic markers
Introduction
Vision impairment is a major global health problem, affecting an estimated 2.2 billion people in 2019, with at least 1 billion cases potentially preventable through early detection and timely treatment1. As populations age, the need for mass screening and automated diagnostics continues to increase. AI-driven solutions offer a promising approach to address critical ophthalmic challenges. Among retinal imaging modalities analyzable by AI, Optical Coherence Tomography (OCT) scans are among the most promising due to the rich structural information they provide. OCT is a noninvasive imaging technique used in ophthalmology to obtain high-resolution cross-sectional images of the retina. An OCT scan consists in a 3D volumetric representation of the retinal layers (Fig. 1b1) and is made of multiple B-scans, which are 2D cross-sectional slices (Fig. 2b2).
Fig. 1. Overview of the research.
a Geographic distribution of the datasets used in the experiments. b Comparison between video and image foundation models for OCT analysis. c Key results: Left: Performance benchmark of video, general image, and retinal image foundation models. Middle: Model performance as a function of training set size. Right: Visualization of attention maps from the video foundation model across the full OCT volume. BioRender.com was used to produce this figure.
Fig. 2. Results.
a Benchmark of state-of-the-art transformer based image (DINOv2, RETFound, VisionFM) and video (V-JEPA) models on the five datasets. Each bar represents the mean value and error bars indicate the minimum and maximum AUROC computed for each test fold. P-values from pairwise DeLong tests comparing each model to V-JEPA, corrected using the Holm-Bonferroni method, are shown above the bars. b Performance as a function of training set size. This experiment is performed on the largest dataset, namely CirrusOCT.
Deep learning has become a reliable and practical tool for the automated identification of ophthalmic and systemic diseases from retinal scans2–7. Among existing machine learning methods, self-supervised learning (SSL) has emerged as a powerful approach to pre-train Vision Transformers (ViTs) on large-scale image datasets, yielding strong feature representations that generalize across diverse domains. SSL enables models to extract meaningful latent representations from unlabeled data, bypassing the need for explicit annotations. This approach leverages vast quantities of unannotated images to create foundation models pretrained on large datasets, which can then be efficiently fine-tuned for downstream tasks with limited labeled data. In medical imaging, SSL-pretrained models have demonstrated significant performance gains in diagnostic applications8. In retinal image analysis, foundation models pretrained in-domain on ophthalmic data have demonstrated improved performance across various tasks compared to conventional deep learning models pretrained from scratch9–11. These foundation models for OCT scan analysis typically train and perform inference using a single OCT B-scan, usually the middle one, based on the assumption that it contains the most clinically relevant information while disregarding the rest of the volume. However, this approach diverges from standard ophthalmological practice for the diagnosis of ophthalmological pathologies. In the case of AMD, clinicians visually inspect the OCT volume for Drusen and fluid. To detect early stages of glaucoma, they typically assess retinal layers automatically segmented by OCT software and compared with normative databases. In both cases, they inherently leverage the structural information embedded within the entire OCT volume, rather than relying on a single B-scan as a localized snapshot. We suggest that training a foundation model on the full OCT volume can further close the performance gap that has so far prevented most models from translating to clinical care, by capturing both B-scan features and slice interactions during the pretraining phase. Consequently, we propose to evaluate the value of using video foundation models for modeling OCT scans and improve the diagnosis of major ophthalmic diseases.
Results
To evaluate the benefits of modeling the entire OCT volume, we compare the V-JEPA video foundation model12 against two recent retinal foundation models named RETFound (version pretrained on OCT), and VisionFM9,10, as well as a state-of-the-art foundation model trained on natural images, denoted DINOv213. These four foundation models rely on the same ViT-L architecture (except for VisionFM, which is based on a ViT-B backbone, conceptually similar but with fewer parameters), so that any performance differences can be attributed to the proposed method rather than important architectural variation. Our research focuses on foundation models because their pretraining characteristic enables them to reach high performance for a given task using a relatively small number of labeled OCTs for fine-tuning. This enables broader applications across different clinical settings. For all baseline models, only the central B-scan from each OCT volume is used, consistent with standard practice in 2D-based OCT analysis9,10,14,15. Performance of the models in five independent OCT datasets from four geographic regions (Table 1), including four publicly available and a newly curated clinical dataset, including diagnoses of Glaucomatous Optic Neuropathy (GON) or Age-related Macular Degeneration (AMD).
Table 1.
Datasets used for the evaluation on downstream tasks
| Task | Dataset | # Patients | # Scans | Prev. (%) | Sex (%) | Age | Geography | OCT device | Resolution (F×H×W) |
|---|---|---|---|---|---|---|---|---|---|
| GON | CirrusOCT18 | 624 | 1110 | 76 | 54 | 22–94 | US | Cirrus SD-OCT Scanner | 64 × 128 × 64 |
| GON | GAMMA19 | 276 | 300 | 50 | 42 | 19–77 | China | Topcon DRI OCT Triton | 256 × 992 × 512 |
| GON | HYMC | 210 | 798 | 33 | 50 | 36–95 | Israel | Topcon DRI OCT Triton | 256 × 992 × 512 |
| AMD | A2A SD-OCT16,17 | 384 | 384 | 70 | - | 50–85 | US | Bioptigen SD-OCT | 100 × 512 × 1000 |
| AMD | NEH27 | 441 | 553 | 66 | - | ≥50 | Iran | Heidelberg SD-OCT | 31 × 496 × 768 |
The two downstream tasks are GON: glaucoma and AMD: age-related macular degeneration detection. Prev.: the prevalence of the downstream task condition. Sex: percentage of female. Resolution (F × H × W) denotes the resolution of the OCT scans, where F represents the number of Frames, H represents the Height, and W represents the Width. Sex and Age statistics are reported for each dataset from values reported in corresponding publications.
Image foundation models
Among the retinal foundation models, VisionFM outperformed RETFound, achieving average AUROC scores across all datasets of 0.90 and 0.87, respectively. DINOv2, a foundation model trained on natural images, surpassed RETFound on four of five datasets and had similar performance on the fifth. DINOv2 was on par with VisionFM with an average AUROC of 0.90 accross all datasets. The performance summary of the models is presented in Table 2.
Table 2.
Results of the four models across the different datasets
| Task | AUROC (min-max over 5 folds) | |||
|---|---|---|---|---|
| DINOv2 | RETFound | VisionFM | V-JEPA | |
| GON identification | ||||
| CirrusOCT (n = 1110) | 0.87 | 0.79 | 0.88 | 0.94⊗ |
| [0.82–0.91] | [0.71–0.84] | [0.85–0.92] | [0.92–0.96] | |
| GAMMA (n = 300) | 0.94 | 0.95 | 0.98 | 0.99 |
| HYMC (n = 798) | 0.74 | 0.73 | 0.76 | 0.80⊗ |
| [0.69–0.78] | [0.68–0.78] | [0.72–0.83] | [0.77–0.83] | |
| AMD identification | ||||
| NEH (n = 553) | 0.94 | 0.88 | 0.92 | 0.97⊗ |
| [0.88–0.99] | [0.85–0.93] | [0.88–0.95] | [0.94–0.98] | |
| A2A SD-OCT (n = 384) | 0.99 | 0.98 | 0.98 | 0.99 |
| [0.99–1.00] | [0.95–1.00] | [0.93–1.00] | [0.99–1.00] | |
Each entry corresponds to the median value (minimum, maximum) AUROC computed for each of the five test fold. This with the exception of GAMMA for which performance is reported for the dataset default train-test split. ⊗ means that the AUC improvement is statistically significant (Pairwise Delong test with Holm-Bonferroni correction p < 0.05). Bold highlights the best performing model for each task.
Video vs. image foundation models
The V-JEPA video foundation model consistently outperformed all single-image foundation models in all datasets except one (Fig. 2a). On A2A SD-OCT16,17, both V-JEPA and DINOv2 obtained similar performance and achieved near-perfect results. The most significant performance gap was observed for the CirrusOCT dataset18, where V-JEPA achieves an AUROC of 0.94 (min-max: 0.92-0.96), compared to 0.88 (min-max: 0.85-0.92) for the second-best model, VisionFM, a difference that was statistically significant (p < 0.001). Two out of five datasets, namely A2A SD-OCT and GAMMA19 showed no statistical difference (p < 0.05) in AUC between V-JEPA and VisionFM. On HYMC, the dataset that shows the lowest performance across the four models, V-JEPA still significantly outperform all single-slice models (as shown in Fig. 2 and Table 2), achieving an average AUROC of 0.80, outperforming VisionFM, the second-best performing model with 0.76 of average AUROC.
Supplementary Table 2 also shows that V-JEPA clearly outperforms single slice foundation models applied independently to multiple slices by processing each slice independently and averaging the predictions. This is even more pronounced for RETFound and VisionFM as they are pretrained on middle slice.
Importance of training set size
Performance as function of training set sizes was evaluated on CirrusOCT, the largest OCT dataset (Fig. 2b). V-JEPA consistently outperforms image foundation models at all data scales. RETFound performs considerably lower across the entire range. In low-data settings, DINOv2 underperforms compared to VisionFM and V-JEPA but catches up to VisionFM as data increases. While VisionFM and DINOv2 saturate with no statistical difference (p < 0.05) when the training set size increases and exceeds 50% of CirrusOCT dataset (≈450 scans), V-JEPA keeps improving as a function of training set size (Fig. 2b). Further work involving a larger dataset would be needed to confirm this observation holds for a larger sample size.
Video foundation model versus other volumetric modeling
The results of the different volumetric modeling methods are shown in Fig. 5a and in Table 3. The fine-tuned V-JEPA model significantly outperformed the 3D CNN on all datasets, Overall, V-JEPA outperforms the DINOv2 2.5D strategy on 4 out of the 5 datasets, the only exception being HYMC. Furthermore, we can observe in Fig. 5c that the computational cost of the 2.5D DINOv2 almost triple the cost for V-JEPA in terms of GFLOPs and latency.
Fig. 5. Analysis of performance and computational cost for different volumetric methods.
a Benchmarking V-JEPA against a 3D CNN trained from scratch and DINOv2 adapted to a 3D setting. b Influence of input slice count on V-JEPA performance. c Computational footprint of the five models, tokens, GFLOPs, latency, and peak GPU memory for a single forward pass.
Table 3.
Results of the three different volumetric modeling methods
| Task | AUROC (min-max over 5 folds) | ||
|---|---|---|---|
| 3D CNN | DINOv2 2.5D | V-JEPA | |
| GON identification | |||
| CirrusOCT (n = 1110) | 0.85 | 0.93 | 0.94 |
| [0.81–0.93] | [0.92–0.95] | [0.92–0.96] | |
| GAMMA (n = 300) | 0.88 | 0.98 | 0.99 |
| HYMC (n = 798) | 0.57 | 0.82 | 0.80 |
| [0.42–0.72] | [0.74–0.87] | [0.77–0.83] | |
| AMD identification | |||
| NEH (n = 553) | 0.50 | 0.94 | 0.97 |
| [0.30–0.62] | [0.89–0.98] | [0.94–0.98] | |
| A2A SD-OCT (n = 384) | 0.85 | 0.99 | 0.99 |
| [0.80–0.93] | [0.98–1.00] | [0.99–1.00] | |
Each entry corresponds to the median value (minimum, maximum) AUROC computed for each of the five test folds. This, with the exception of GAMMA, for which performance is reported for the dataset default train-test split. Bold highlights the best performing model for each task.
Discussion
Recent retinal foundation models like RETFound9, EyeFound11, and VisionFM10 represent a significant milestone in leveraging large-scale datasets and self-supervised pretraining methods. However, their reliance on single-slice OCT analysis remains a critical limitation, as it fails to leverage the full volumetric richness of OCT data, which inherently contains spatial and structural information across multiple slices.
RETFound9 was the first retinal foundation model, trained on 736,442 OCT middle slices using a masked autoencoder (MAE) approach, and later fine-tuned on labeled datasets for diagnostic and prognostic tasks. EyeFound11 extended this to a multimodal setting, training on 2.78 million images from 11 modalities, while still relying solely on the middle OCT slice. It was unavailable during our experiments and thus not benchmarked. VisionFM10 built on prior approaches by adopting the iBOT framework20, training on 3.4 million images, including 1.4 million OCT B-scans across eight modalities. Following the trajectory established by VisionFM, recent retinal foundation model development has increasingly shifted toward multimodal pretraining. A recent example is EyeFM21, which was trained on 14.5 million ocular images from five imaging modalities paired with clinical text. Both VisionFM and EyeFM demonstrate that multimodal learning can yield richer representations with improved generalization across tasks and imaging devices. Other multimodal works have shown promising results22–24. These retinal foundation models have significantly improved downstream task performance but remain limited by their reliance on single-slice OCT data, overlooking the richer context of full-volume scans.
Kermany et al.14 used individual B-scans processed with a 2D CNN pretrained on ImageNet. Although their approach was two-dimensional, it clearly demonstrated the value of leveraging models pretrained on large-scale image datasets and adapting them to OCT analysis. De Fauw et al.25 proposed a different pipeline consisting of one network dedicated to tissue segmentation and a second network for classification based on the volumetric segmentation maps. While powerful, this approach is computationally demanding and relies on extensive pixel-level annotations, in contrast to methods that directly process the raw OCT volume. Maetschke et al.18 conducted seminal studies exploring the use of 3D CNNs trained from scratch for glaucoma detection. However, their experiments were limited to a single dataset. Ran et al.26 further investigated the value of a 3DCNN approach and included validation on a local as well as external test sets. They 3DCNN model was, however, trained from scratch.
Our study validates the hypothesis that leveraging a video foundation model, as opposed to a single-slice foundation model, yields significant improvements on downstream tasks. Specifically, our results across five independent datasets show an average AUROC improvement of 4% and up to 6.5% when using the video foundation model V-JEPA compared to the best-performing image-based foundation model. In the two datasets where AUC improvements were not statistically significant, the lack of difference can be attributed to near-ceiling performance (e.g., 0.99 for V-JEPA vs. 0.98 for VisionFM), suggesting limited room for further improvement. Model performance was consistently lower on the HYMC dataset compared to the others. Unlike the rest, HYMC’s control group includes individuals who were initially glaucoma suspects but later cleared of the diagnosis, rather than healthy individuals. This distinction likely contributed to the reduced performance, as differentiating between suspects and true glaucoma cases is inherently more challenging than distinguishing between glaucoma and unaffected individuals. These empirical findings suggest that relying on single-image foundation models results in the loss of clinically relevant information. First, the video model effectively increases spatial resolution, ensuring that sparse retinal abnormalities, such as drusen, are not overlooked. Second, it captures inter-slice interactions and spatial continuity, features inaccessible to single-slice models, thereby enabling the model to learn and infer from the full 3D structure of the retina.
To visualize this, we extracted attention maps from the last transformer block of V-JEPA and DINOv2 on an AMD case from the NEH dataset. For V-JEPA, we display 8 of the 16 input slices (Fig. 3a), showing how the model attends to anatomical features distributed across frames, many of which are not visible in the middle slice alone. The comparison with DINOv2 attention (Fig. 3b) highlights how V-JEPA leverages contextual information from surrounding slices to form a deeper understanding of anatomical structures, compared to the more limited view offered by DINOv2.
Fig. 3. Explainability.
a Attention maps from V-JEPA across multiple B-scans of an AMD OCT, illustrating how the model dynamically tracks relevant anatomical features over the full volume of the retina. b Comparison of attention of V-JEPA and DINOv2 on the same middle slice (Slice 8) shows that V-JEPA focuses more precisely on clinically relevant structures compared to a single-frame image-based model, highlighting the benefit of spatial context in volumetric interpretation.
The error analysis in Fig. 4. highlights cases where the single-slice foundation model underperformed. Almost all DINOv2 true positives show visible pathology in the central slice, suggesting that the model struggles when drusen or CNV are absent from this region. This is an expected limitation given it only processes the middle slice. The second group of false negatives for DINOv2 but true positives for V-JEPA sheds light on why single-slice models lag behind volumetric ones. In 57% of these cases, pathology is visible in the central slice; in the rest, it appears elsewhere in the volume, typically near the macula. This indicates that the volumetric model not only detects AMD outside the central slice but also leverages contextual information from the full volume to more accurately interpret pathology when it is present in the center.
Fig. 4. Distance in slices between the middle slice and the closest one showing signs of pathology (Drusen or CNV).

Showed for 2 groups, DINOv2 true positive, and the intersection between DINOv2 false negative and V-JEPA true positive. True positives and false negatives were found using threshold maximizing F1 score.
The results presented in the panel (b) of Fig. 5. and in Table 4 shows that using 16 slices leads to better performance (for HYMC, CirrusOCT, and NEH) or at least comparable performance than using 8 or 32 slices. This is consistent with the fact that V-JEPA was originally pretrained on 16 frames per video.
Table 4.
Results of V-JEPA with varying number of slices as input
| Task | AUROC (min-max over 5 folds) | ||
|---|---|---|---|
| 8 frames | 16 frames | 32 frames | |
| GON identification | |||
| CirrusOCT (n = 1110) | 0.92 | 0.94 | 0.93 |
| [0.88–0.95] | [0.92–0.96] | [0.89–0.96] | |
| GAMMA (n = 300) | 0.98 | 0.99 | 0.99 |
| HYMC (n = 798) | 0.79 | 0.80 | 0.79 |
| [0.70–0.86] | [0.77–0.83] | [0.76–0.85] | |
| AMD identification | |||
| NEH (n = 553) | 0.93 | 0.97 | 0.93 |
| [0.89–0.97] | [0.94–0.98] | [0.80–0.99] | |
| A2A SD-OCT (n = 384) | 1.0 | 0.99 | 1.00 |
| [0.99–1.00] | [0.99–1.00] | [0.99–1.00] | |
Each entry corresponds to the median value (minimum, maximum) AUROC computed for each of the five test folds. This, with the exception of GAMMA, for which performance is reported for the dataset default train-test split. Bold highlights the best performing model for each task.
Supplementary Table 1 reports the sensitivity of each model across all datasets at a fixed specificity of 80%, corresponding to a clinically relevant operating point. At this specificity, V-JEPA achieves the highest sensitivity on all but one dataset, demonstrating that its performance gains translate into meaningful improvements in potential clinical utility.
This research findings highlight the potential in leveraging advances in deep learning for video modeling to the processing of volumetric OCT scans. This study demonstrates that methods developed for video analysis can be effectively repurposed to handle the unique challenges of volumetric OCT scan, unlocking new opportunities for improving diagnostic models.
Our study also validates the hypothesis that leveraging native 3D foundation model is the best option to model OCT scans as a 3D volume. Indeed, V-jepa finetuned outperforms the DINOv2 2.5D strategy on 4 out of the 5 datasets. The only exception is the HYMC dataset, where the 2.5D approach achieves higher average performance, though it had a much greater variance. This outcome is consistent with Bardes et al.12, where the 2.5D method was also found to underperform relative to V-JEPA. Notably, DINOv2-2.5D still represents a substantial improvement over the single-slice DINOv2 baseline, reinforcing our claim that modeling OCT scans based solely on the middle slice has important limitations, though it appears that pretraining on volume allows the model to learn inter-slice interactions, which improves performance on downstream tasks.
The training set size analysis suggests that V-JEPA demonstrates strong generalization even with a limited number of training examples, while its performance continues to improve as the training set size increases. In contrast, DINOv2 and VisionFM reach a performance plateau with just 50% of the training data. VisionFM nonetheless maintains higher performance than RETFound when fine-tuned on smaller datasets, likely because it was pretrained on an OCT dataset twice bigger than in RETFound which may provide a more robust and transferable representations. These findings indicate that larger OCT datasets may lead to improved performance for V-JEPA. We acknowledge that model performance cannot be attributed solely to sample size. Other factors, such as disease severity, also play an important role. Supplementary Fig. 1 displays the AUROC stratified by disease severity (Drusen versus Choroidal Neovascularization) relative to “Normal” cases for V-JEPA, DINOv2, RETFound, and VisionFM. As expected, performance is consistently higher for Choroidal Neovascularization, reflecting its status as a more advanced manifestation of the disease.
Our findings are consistent with how retinal experts navigate the full OCT scan during examination, where they rarely rely on a single slice; this study aligns model design with real-world clinical workflows by emphasizing the value of volumetric analysis. Moreover, it highlights that pathologies such as AMD and GON often exhibit spatial continuity, with disease patterns that may span across distant slices. Capturing such patterns requires volume-aware modeling, which 2D approaches are inherently unequipped to handle. The findings of this research may prompt a paradigm shift in the development of foundation models for OCT analysis, moving from a 2D to a 3D approach. Although some recent retinal foundation models have attempted to create a single model applicable to all types of retinal scans, or even to broader ophthalmic imaging data, this unified approach, though appealing, is inherently limiting. In contrast, our results advocate for the development of OCT-specific foundation models that explicitly model the retina in three dimensions. Furthermore, these findings may influence data collection and sharing practices. Open-source volumetric OCT datasets remain scarce, with most publicly available datasets limited to single-slice images.
This study has several limitations. First, although this research includes five independent datasets, further evaluation on additional tasks and datasets from diverse medical centers worldwide is necessary to ensure broader representation across ethnicities, comorbidities, clinical practices, imaging devices, and fields of view. Additional datasets covering a wider range of acquisition devices and protocols, as well as demographics and population samples, are also crucial to enabling a more comprehensive quantitative analysis of their impact. This would also allow us to test for out-of-distribution generalization, which is crucial to support the clinical use of the model. Second, we fine-tuned a general-domain video foundation model. It is likely that even better performance could be achieved by developing a domain-specific foundation model explicitly designed to capture spatiotemporal patterns in volumetric OCT data, pretrained on large-scale in-domain scans. In addition, since we relied on a pretrained video model, we were constrained to using 16 slices, the number of frames it was originally trained on. A foundation model pretrained directly on volumetric OCT scans could allow for a more flexible and task-adapted choice of slice count, potentially improving performance and efficiency, and subsequent work will be needed to define the optimal slice configuration for peak performance. This underscores the need for a paradigm shift in how OCT volumes are leveraged in retinal foundation models, moving beyond single-slice representations toward fully volumetric pretraining strategies. Although the single-slice and volume foundation models have a comparable number of parameters, as their encoder, the core component and main bottleneck, adopts a ViT-L architecture, we observed a fivefold increase in latency for the volume foundation model (0.012 s vs. 0.06 s on a single A100 GPU). This overhead arises from the need to pre-process a larger number of slices. While optimizing this step should be the focus of future work, the absolute runtimes remain negligible, making the approach viable for deployment.
Our findings highlight the potential of repurposing general-domain video foundation models for volumetric OCT analysis and significantly improving diagnostic performance. Further gains may be achievable through the development of domain-specific models pretrained on OCT data. Together, these insights advocate for a shift in both model design and data-sharing practices, toward embracing the full richness of volumetric retinal imaging.
Integrating this paradigm shift within a multimodal framework may constitute a meaningful step toward the long-term objective of medical foundation models, namely, unified systems capable of learning from diverse multimodal clinical data. Translating such systems into meaningful clinical tools, however, will require an evolution in validation methodology. As illustrated by initiatives such as EyeFM21, this involves moving beyond benchmark performance toward real-world evaluation, including workflow integration and systematic assessment of safety, robustness, and clinical impact.
METHODS
Study design
We benchmark a state-of-the-art SSL-pretrained video model V-JEPA12, pretrained on general-domain videos, against single-image foundation models for OCT analysis. The latter includes both in-domain foundation-pretrained models9,10 as well as, DINOv213, a state-of-the-art foundation models trained on natural images.
We also benchmark V-JEPA against two different methods of modeling 3D OCT scans. The first one is a 3D CNN pretrained from scratch following the configuration introduced in Maetschke et al.18. The second one is leveraging 2D foundation models following the methodology developed in the original work of Bardes et al.12. Each slice of the volume is processed independently with DINOv2 to extract a feature map. These feature maps are then concatenated and fed into an attentive probe.
Datasets
Overall, a total of five independent datasets from four geographical regions are used for the experiments (Table 1). Four datasets are open access, while one new dataset (denoted HYMC) was elaborated as part of this research.
CirrusOCT18: This dataset consists of OCT scans centered on the optic nerve head, acquired from 624 patients totaling 1,110 scans. Each scan has a physical size of 6 × 6 × 2 mm and 64 × 128 × 64 voxels per volume. Of these, 263 scans were diagnosed as healthy, while 847 were classified as primary open-angle GON.
GAMMA19: This dataset originates from the GAMMA challenge. It consists of 300 fundus-OCT pairs from 276 Chinese patients, categorized as non-GON (151 individuals randomly selected from a myopia cohort), early GON (77 individuals), and mid-advanced GON (72 individuals). The scans are macula-centered, with a 3 × 3 mm en-face field of view and have a shape of 256 × 992 × 512.
HYMC: This dataset was specifically created for this study by the Glaucoma Unit of the Hillel Yaffe Ophthalmology Department in Hadera, Israel (Helsinki approval number 0029-24-HYMC, co-authors EB and MM). Informed consent from patients was exempted by the institutional review board, as only de-identified retrospective data were used. All patients underwent a comprehensive ophthalmic examination as part of standard clinical care for glaucoma management. The dataset includes 798 OCT scans from 210 patients. Among them, 109 patients were diagnosed with GON. The remaining patients were glaucoma suspects with optic discs appearing suspicious for glaucoma. However, GON diagnosis was ruled out after six to twelve months of follow-up by an expert glaucoma specialist. Each 3D OCT scan has dimensions of 256 × 992 × 512.
A2A SD-OCT16,17: This prospective observational study includes SD-OCT scans from elderly subjects; 115 with no ocular diseases and 269 exhibiting bilateral large Drusen (≥125 μm) or large Drusen in one eligible eye and advanced AMD in the second eye. Volumetric scans (6.7 × 6.7 × 6.7 mm corresponding to 100 × 512 × 1000 voxels) were acquired at four clinical sites in the US.
NEH27: This dataset consists in 553 OCT scans labeled by a retinal specialist for 441 patients, including: Normal: 120, Drusen: 160, Choroidal neovascularization: 161. The data was acquired at Noor Eye Hospital, Tehran, Iran. The scans have a shape of 31 × 496 × 768.
From volume to video: OCT scan as a sequence of slices
A volumetric OCT scan is a sequence of slices, similar to a video, which consists of a sequence of frames. To leverage this analogy, we utilize a state-of-the-art video foundation model pretrained on natural videos called V-JEPA12. Most video foundation models are trained using the masked autoencoder (MAE) approach applied to spatiotemporal data28. This approach works by randomly masking video patches and training an encoder-decoder to reconstruct the raw pixel values of the masked regions based on the unmasked patches. Bardes et al.12 propose learning predictive representations in latent space by predicting the latent embeddings of masked spatiotemporal patches from visible context, rather than reconstructing pixels. This promotes abstraction, efficiency, and improved transfer.
V-JEPA processes videos by first temporally downsampling them to 16 frames using a fixed stride, and spatially resizing and cropping each frame to 224 × 224. This results in an input volume of 16 × 224 × 224, from which spacetime patches of size 2 × 16 × 16 are extracted. For our experiments, we follow the same setup as V-JEPA, i.e., we downsample the OCT slices to be 224 × 224 using cropping. Since we work with volumetric OCT scans, the “temporal” dimension corresponds to spatial depth, each frame represents a different cross-sectional slice of the retina. As a result, while the input format mirrors that of standard video models, the 16-frame sequence captures anatomical variation across depth rather than motion over time.
To assess the value of modeling the full volume, we benchmark V-JEPA against two retinal foundation models9,10 and a state-of-the-art foundation models pretrained on natural images13. For these baselines, we use only the central B-scan of the OCT volume, reflecting the standard practice in 2D-based approaches.9,10,14,15 All of these models were finetuned on the data and downstream tasks reported in Table 1.
Implementation
We conducted all experiments on a pair of A100 GPUs with 80 GB of memory. Each finetuning run on one dataset lasted several hours. While we generally adopted the hyperparameters reported in the original publications introducing each model, we opted for a smaller learning rate to ensure convergence, given the small size of our datasets.
For data augmentation during the finetuning of the image foundation models, we followed standard practices commonly used in the image-based models and adopted the RandAugment strategy, including random cropping.29 At inference, we only apply center cropping and bi-cubic interpolation to reach the required dimensions.
Finetuning V-JEPA presents additional challenges for data augmentation as it uses volumetric data. To maintain spatial coherence, we applied identical 2D augmentations across all slices in a volume30. For augmentations in the third dimension, we introduced jitter by randomly sampling slices. We draw inspiration from Feichtenhofer et al.31, where the video is split into 16 consecutive segments with equal lengths and one frame is sampled from each segment. OCT data exhibits even greater redundancy than videos due to gradual anatomical changes across adjacent B-scans. For that purpose, we divide an OCT stack into four equal-sized consecutive segments and randomly select four frames from each segment at each training iteration to construct the 16-slice input to the model. At inference, we uniformly sample 16 slices from the volume and center-crop them to 224 × 224, consistent with the preprocessing used in image-based foundation models. Both for training and inference, we made sure to include the middle scan in the 16 slices sampled as input to have a fair and reliable comparison.
Statistical analysis
Our evaluation strategy followed the methodology commonly adopted in prior retinal foundation model studies9–11, where each dataset is considered independently. Similarly, our research focuses on the common scenario in which OCT data from a single medical center and its associated population sample are available, but only in limited quantity. Our evaluation, therefore, quantifies the performance of the foundation model after fine-tuning on such a small, locally available training set. Experiments consisted of all models being separately finetuned end-to-end on five datasets for the tasks of AMD and GON detection. The model's performance on binary classification is evaluated using the area under the receiver operating characteristic curve (AUROC). To obtain a precise estimate of model generalization performance, we use a nested cross-validation strategy, dividing the data into five train/test folds. Each training fold is further split into training and validation subsets, the latter used to determine the optimal number of training epochs. To prevent data leakage, all data splitting was performed at the patient level whenever patient identifiers were available. The model is then trained on the combined training and validation data for the determined number of epochs. Inference is then performed on the test fold. AUROC is reported for each test fold, with the mean reported as a measure of central tendency, and the minimum and maximum AUROC values across the five nested folds reported as a measure of dispersion. For the GAMMA dataset, patient identifiers were unavailable, preventing us from grouping scans by individuals. To avoid potential data leakage, we experimented solely on the original train/validation/test split provided by the original authors.
Pairwise DeLong test with Holm–Bonferroni correction was used on the aggregated predictions of the five folds to show a statistical difference in performance between V-JEPA and the second-best performing model.
Effect of training set size
To assess the impact of training set size on model performance, we generated performance curves that report test set accuracy as a function of the available training data. For this experiment, we selected the dataset with the largest number of OCT scans, i.e., CirrusOCT. Nested cross-validation was applied as previously described. Within each nested fold, models were trained on varying proportions of the training set (12.5%, 25%, 50%, and 100%), while the test set remained fixed. This procedure was repeated for each of the foundation models.
Influence of the number of input slices
To quantify how volumetric context affects performance, we evaluated V-JEPA with varying numbers of input slices (8, 16, and 32). Thanks to the model design, and in particular its ViT-based architecture, the number of slices can be modified, much like changing the input resolution, by interpolating the positional embeddings without having to re-pretrain the model. The frame sampling method was kept consistent throughout the three experiments.
We also profiled compute for each configuration, reporting GFLOPs, latency, and peak GPU memory for a single forward pass. These metrics are computed for the model processing only, excluding data loading and preprocessing, as they are dataset-dependent.
Supplementary information
Acknowledgements
We acknowledge the assistance of ChatGPT, an AI-based language model developed by OpenAI, in editing the manuscript. R.J. and J.B. acknowledge the support of the Zimin Foundation.
Author contributions
Conceptualization: J.B. and R.J. Methodology: J.B., R.J., and T.M. Data curation: E.B. and M.M. Investigation: J.B. and R.J. Funding acquisition: J.B. Writing—original draft: J.B. and R.J. Writing—review & editing: J.B., R.J., E.B., M.M., and T.M.
Data availability
We used open access datasets: CirrusOCT (https://zenodo.org/records/1481223) and Gamma (gamma.grand-challenge.org), A2A OCT (people.duke.edu/~sf59/RPEDC_Ophth_2013_dataset.htm) NEH-UT (data.mendeley.com/datasets/8kt969dhx6/2). The HYRD dataset may be made available for non-commercial academic use from the authors with permission from the Hillel Yaffe Medical Center. Please contact the corresponding author regarding such requests.
Code availability
The source code used in this study is available through this GitHub repository https://github.com/aim-lab/oct-fm-slices-to-volumes that will be made public upon publication. Information on how to run the code is contained within the README.md file.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
The online version contains supplementary material available at 10.1038/s41746-026-02496-7.
References
- 1.World Health Organization. World report on vision. https://www.who.int/publications/i/item/world-report-on-vision (2019).
- 2.Poplin, R. et al. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat. Biomed. Eng.2, 158–164 (2018). [DOI] [PubMed] [Google Scholar]
- 3.Ting, D. S. W. et al. Artificial intelligence and deep learning in ophthalmology. Br. J. Ophthalmol.103, 167–175 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Cheung, C. Y. et al. A deep learning model for detection of alzheimer’s disease based on retinal photographs: a retrospective, multicentre case-control study. Lancet Digit. Health4, E806–E815 (2022). [DOI] [PubMed] [Google Scholar]
- 5.Srivastava, O., Tennant, M., Grewal, P., Rubin, U. & Seamone, M. Artificial intelligence and machine learning in ophthalmology: a review. Indian J. Ophthalmol.71, 11–17 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Men, Y. et al. Drstagenet: deep learning for diabetic retinopathy staging from fundus images. Physiol. Meas.46, 015001 (2025). [DOI] [PubMed] [Google Scholar]
- 7.Abramovich, O. et al. Gonet: a generalizable deep learning model for glaucoma detection. IEEE Trans. Biomed. Eng. 73, 32–39 (2025). [DOI] [PubMed]
- 8.Krishnan, R., Rajpurkar, P. & Topol, E. J. Self-supervised learning in medicine and healthcare. Nat. Biomed. Eng.6, 1346–1352 (2022). [DOI] [PubMed] [Google Scholar]
- 9.Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images. Nature622, 156–163 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Qiu, J. et al. Development and validation of a multimodal multitask vision foundation model for generalist ophthalmic artificial intelligence. NEJM AI1, AIoa2300221 (2024). [Google Scholar]
- 11.Shi, D. et al. Eyefound: a multimodal generalist foundation model for ophthalmic imaging. In Proc. Poster session presented at The International Conference of Vision and Eye Research (iCover) (ARVO, 2024).
- 12.Bardes, A. et al. Revisiting feature prediction for learning visual representations from video arXiv preprint arXiv:2404.08471 (2024).
- 13.Oquab, M. et al. DINOv2: learning robust visual features without supervision. Trans. Mach. Learn. Res. (2024).
- 14.Kermany, D. S. et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell172, 1122–1131.e9 (2018). [DOI] [PubMed] [Google Scholar]
- 15.Wen, H. et al. Towards more efficient ophthalmic disease classification and lesion location via convolution transformer. Comput. Methods Prog. Biomed.220, 106832 (2022). [DOI] [PubMed] [Google Scholar]
- 16.Leuschen, J. N. et al. Spectral-domain optical coherence tomography characteristics of intermediate age-related macular degeneration. Ophthalmology120, 140–150 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Farsiu, S. et al. Quantitative classification of eyes with and without intermediate age-related macular degeneration using optical coherence tomography. Ophthalmology121, 162–172 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Maetschke, S. et al. A feature agnostic approach for glaucoma detection in oct volumes. PLoS ONE14, e0219126 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Wu, J. et al. Gamma challenge: glaucoma grading from multi-modality images. Med. Image Anal.90, 102938 (2023). [DOI] [PubMed] [Google Scholar]
- 20.Zhou, J. et al. iBOT: image BERT pre-training with online tokenizer. In Proc. 10th International Conference on Learning Representations (ICLR, 2022).
- 21.Wua, Y. et al. An eyecare foundation model for clinical assistance: a randomized controlled trial. Nat. Med.31, 3404–3413 (2025). [DOI] [PubMed] [Google Scholar]
- 22.Shi, D. et al. A multimodal visual-language foundation model for computational ophthalmology. npj Digit. Med.8, 381 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Morano, J. et al. Multimodal foundation model and benchmark for comprehensive retinal oct image analysis. npj Digit. Med.8, 576 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Silva-Rodríguez, J., Chakor, H., Kobbi, R., Dolz, J. & Ayed, I. B. A foundation language-image model of the retina (flair): encoding expert knowledge in text supervision. Med. Image Anal.99, 103357 (2025). [DOI] [PubMed] [Google Scholar]
- 25.DeFauw, J. et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat. Med.24, 1342–1350 (2018). [DOI] [PubMed] [Google Scholar]
- 26.Ran, A. R. et al. Three-dimensional multi-task deep learning model to detect glaucomatous optic neuropathy and myopic features from optical coherence tomography scans: a retrospective multi-centre study. Front. Med.9, 860574 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Sotoudeh-Paima, S., Jodeiri, A., Hajizadeh, F. & Soltanian-Zadeh, H. Multi-scale convolutional neural network for automated amd classification using retinal oct images. Comput. Biol. Med.144, 105368 (2022). [DOI] [PubMed] [Google Scholar]
- 28.Tong, Z., Song, Y., Wang, J. & Wang, L. Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training. In Proc. 36th International Conference on Neural Information Processing Systems, 10078–10093 (NIPS, 2022).
- 29.Cubuk, E. D., Zoph, B., Shlens, J. & Le, Q. V. Randaugment: practical automated data augmentation with a reduced search space. In Proc. 34th International Conference on Neural Information Processing Systems, 18613–18624 (NIPS, 2020).
- 30.Arnab, A. et al. Vivit: a video vision transformer. In Proc. IEEE/CVF International Conference on Computer Vision, 6836–6846 (ICCV, 2021).
- 31.Feichtenhofer, C., Fan, H. & He, Y. L. K. Masked autoencoders as spatiotemporal learners. In Proc. 36th International Conference on Neural Information Processing Systems, Vol. 35, 35946–35958 (NIPS, 2022).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
We used open access datasets: CirrusOCT (https://zenodo.org/records/1481223) and Gamma (gamma.grand-challenge.org), A2A OCT (people.duke.edu/~sf59/RPEDC_Ophth_2013_dataset.htm) NEH-UT (data.mendeley.com/datasets/8kt969dhx6/2). The HYRD dataset may be made available for non-commercial academic use from the authors with permission from the Hillel Yaffe Medical Center. Please contact the corresponding author regarding such requests.
The source code used in this study is available through this GitHub repository https://github.com/aim-lab/oct-fm-slices-to-volumes that will be made public upon publication. Information on how to run the code is contained within the README.md file.




