Table 3.
Overview of vision-language models in biomedical healthcare
Model Name | Type | Image Encoder | Text Encoder | Training Corpora | Release Date |
---|---|---|---|---|---|
ConVIRT | Dual | ResNet | ClinicalBERT | MIMIC-CXR | 2022 |
GLoRIA | Dual | ResNet | BioClinicalBERT | Chexpert [240] | 2021 |
MedCLIP | Dual | ResNet/ViT | BioClinicalBERT | Chexpert, MIMIC-CXR | 2022 |
CheXZero | Dual | CLIP-Image | CLIP-Text | Chest X-rays | 2022 |
LoVT | Dual | ResNet | ClinicalBERT | MIMIC-CXR | 2022 |
Adapted VLMs | Hierarchical | Diffusion, VAE [241] | Bert, CLIP | Chexpert, MIMIC-CXR | 2022 |
VisualBERT | Fusion | Varies | BERT | MIMIC-CXR | 2020 |
MedViLL | Fusion | ResNet | BERT | MIMIC-CXR | 2022 |
ARL | Fusion | CLIP-Image | RoBERTa [242] | MedICaT [243], MIMIC-CXR, ROCO[244] | 2022 |
LViT | Fusion | ViT | BERT | QaTa-COV19 [245], MoNuSeg [246] | 2023 |
RoentGen | Hierarchical | Diffusion | CLIP-Text | MIMIC-CXR | 2022 |
CLIPSyntel | Dual | CLIP | GPT-3.5 | MMQS [247] | 2024 |
Med-unic | Dual | ResNet/ViT | CXR-BERT [248] | MIMIC-CXR, PadChest [249] | 2024 |
EchoCLIP | Dual | ConvNeXt [250] | CLIP-Text | Echocardiogram videos | 2024 |
Llava-med | Fusion | Llava [8] | Llava | PubMed [251], PMC-15M [252] | 2024 |