Skip to main content
. 2025 Aug 8;26(10):900–923. doi: 10.3348/kjr.2025.0599

Table 2. Summary of representative 2D multimodal LLMs in radiology trained by contrastive learning.

Model Base architecture (vision + LLM) Key technique(s) Primary task(s) Dataset(s) used Key strength/contribution
ConVIRT [52] ResNet50 + ClinicalBERT Bidirectional imagetext contrastive pretraining, largebatch unsupervised learning Zeroshot classification & retrieval MIMICCXR v2 (227K) + internal musculoskeletal set (48K pairs) First medical imagetext contrastive framework
MedCLIP [53] ViT + BioClinicalBERT Decoupled contrastive learning, semantic matching loss (using medical knowledge) Zero-shot classification, supervised classification, image-text retrieval Unpaired images/text (e.g., CheXpert, MIMIC-CXR) High data efficiency, addresses false negatives, strong zero-shot performance
BioViL-T [54] Hybrid CNN-transformer multi-image encoder + CXR-BERT Temporal vision-language pretraining, contrastive learning Progression classification, phrase grounding, RRG MIMIC-CXR (longitudinal pairs) First model with temporal awareness, SOTA on temporal tasks
BioMedCLIP [56] ViT + PubMedBERT Large-scale contrastive pre-training Cross modal retrieval, zero-shot/few-shot/full-shot image classification, VQA PMC-15M (15 million diverse biomedical image-text pairs) Domain-specific adaptations, positive transfer learning demonstrated

LLM = large language model, ConVIRT = contrastive learning of medical visual representations, ResNet = residual network, BERT = bidirectional encoder representation from Transformers, MIMIC-CXR = medical information mart for intensive care chest X-ray dataset, CXR = chest X-ray, MedCLIP = medical contrastive language-image pre-training, ViT = vision transformer, BioClinicalBERT = clinical BERT pre-trained on biomedical notes, CheXpert = chest X-ray expert-labeled dataset from Stanford, BioViL-T = biomedical vision-language model with temporal modeling, CNN = convolutional neural network, CXR-BERT = BERT variant trained on chest X-ray reports, RRG = radiology report generation, SOTA = state-of-the-art, BioMedCLIP = biomedical CLIP-style pre-training using ViT and PubMedBERT, PubMedBERT = BERT pre-trained on PubMed abstracts, VQA = visual question answering, PMC = PubMed Central