Skip to main content
. 2024 Jun 4;108(10):e325459. doi: 10.1136/bjo-2024-325459

Figure 4. Pipeline for training vision-language models. The image and text data are independently processed by encoders to generate feature embeddings representative of images and text. The vision-language models are trained to maximise the agreement between image and text feature embeddings. The trained encoders apply to both image-based and text-based downstream tasks. OCT, optical coherence tomography; DR, diabetic retinopathy.

Figure 4