Table 2. Summary of representative 2D multimodal LLMs in radiology trained by contrastive learning.
| Model | Base architecture (vision + LLM) | Key technique(s) | Primary task(s) | Dataset(s) used | Key strength/contribution |
|---|---|---|---|---|---|
| ConVIRT [52] | ResNet50 + ClinicalBERT | Bidirectional imagetext contrastive pretraining, largebatch unsupervised learning | Zeroshot classification & retrieval | MIMICCXR v2 (227K) + internal musculoskeletal set (48K pairs) | First medical imagetext contrastive framework |
| MedCLIP [53] | ViT + BioClinicalBERT | Decoupled contrastive learning, semantic matching loss (using medical knowledge) | Zero-shot classification, supervised classification, image-text retrieval | Unpaired images/text (e.g., CheXpert, MIMIC-CXR) | High data efficiency, addresses false negatives, strong zero-shot performance |
| BioViL-T [54] | Hybrid CNN-transformer multi-image encoder + CXR-BERT | Temporal vision-language pretraining, contrastive learning | Progression classification, phrase grounding, RRG | MIMIC-CXR (longitudinal pairs) | First model with temporal awareness, SOTA on temporal tasks |
| BioMedCLIP [56] | ViT + PubMedBERT | Large-scale contrastive pre-training | Cross modal retrieval, zero-shot/few-shot/full-shot image classification, VQA | PMC-15M (15 million diverse biomedical image-text pairs) | Domain-specific adaptations, positive transfer learning demonstrated |
LLM = large language model, ConVIRT = contrastive learning of medical visual representations, ResNet = residual network, BERT = bidirectional encoder representation from Transformers, MIMIC-CXR = medical information mart for intensive care chest X-ray dataset, CXR = chest X-ray, MedCLIP = medical contrastive language-image pre-training, ViT = vision transformer, BioClinicalBERT = clinical BERT pre-trained on biomedical notes, CheXpert = chest X-ray expert-labeled dataset from Stanford, BioViL-T = biomedical vision-language model with temporal modeling, CNN = convolutional neural network, CXR-BERT = BERT variant trained on chest X-ray reports, RRG = radiology report generation, SOTA = state-of-the-art, BioMedCLIP = biomedical CLIP-style pre-training using ViT and PubMedBERT, PubMedBERT = BERT pre-trained on PubMed abstracts, VQA = visual question answering, PMC = PubMed Central