Skip to main content
. 2025 Jan 4;18:2. doi: 10.1186/s13040-024-00414-9

Table 3.

Overview of vision-language models in biomedical healthcare

Model Name Type Image Encoder Text Encoder Training Corpora Release Date
ConVIRT Dual ResNet ClinicalBERT MIMIC-CXR 2022
GLoRIA Dual ResNet BioClinicalBERT Chexpert [240] 2021
MedCLIP Dual ResNet/ViT BioClinicalBERT Chexpert, MIMIC-CXR 2022
CheXZero Dual CLIP-Image CLIP-Text Chest X-rays 2022
LoVT Dual ResNet ClinicalBERT MIMIC-CXR 2022
Adapted VLMs Hierarchical Diffusion, VAE [241] Bert, CLIP Chexpert, MIMIC-CXR 2022
VisualBERT Fusion Varies BERT MIMIC-CXR 2020
MedViLL Fusion ResNet BERT MIMIC-CXR 2022
ARL Fusion CLIP-Image RoBERTa [242] MedICaT [243], MIMIC-CXR, ROCO[244] 2022
LViT Fusion ViT BERT QaTa-COV19 [245], MoNuSeg [246] 2023
RoentGen Hierarchical Diffusion CLIP-Text MIMIC-CXR 2022
CLIPSyntel Dual CLIP GPT-3.5 MMQS [247] 2024
Med-unic Dual ResNet/ViT CXR-BERT [248] MIMIC-CXR, PadChest [249] 2024
EchoCLIP Dual ConvNeXt [250] CLIP-Text Echocardiogram videos 2024
Llava-med Fusion Llava [8] Llava PubMed [251], PMC-15M [252] 2024