. 2025 Aug 8;26(10):900–923. doi: 10.3348/kjr.2025.0599

Table 2. Summary of representative 2D multimodal LLMs in radiology trained by contrastive learning.

Model	Base architecture (vision + LLM)	Key technique(s)	Primary task(s)	Dataset(s) used	Key strength/contribution
ConVIRT [52]	ResNet50 + ClinicalBERT	Bidirectional imagetext contrastive pretraining, largebatch unsupervised learning	Zeroshot classification & retrieval	MIMICCXR v2 (227K) + internal musculoskeletal set (48K pairs)	First medical imagetext contrastive framework
MedCLIP [53]	ViT + BioClinicalBERT	Decoupled contrastive learning, semantic matching loss (using medical knowledge)	Zero-shot classification, supervised classification, image-text retrieval	Unpaired images/text (e.g., CheXpert, MIMIC-CXR)	High data efficiency, addresses false negatives, strong zero-shot performance
BioViL-T [54]	Hybrid CNN-transformer multi-image encoder + CXR-BERT	Temporal vision-language pretraining, contrastive learning	Progression classification, phrase grounding, RRG	MIMIC-CXR (longitudinal pairs)	First model with temporal awareness, SOTA on temporal tasks
BioMedCLIP [56]	ViT + PubMedBERT	Large-scale contrastive pre-training	Cross modal retrieval, zero-shot/few-shot/full-shot image classification, VQA	PMC-15M (15 million diverse biomedical image-text pairs)	Domain-specific adaptations, positive transfer learning demonstrated

LLM = large language model, ConVIRT = contrastive learning of medical visual representations, ResNet = residual network, BERT = bidirectional encoder representation from Transformers, MIMIC-CXR = medical information mart for intensive care chest X-ray dataset, CXR = chest X-ray, MedCLIP = medical contrastive language-image pre-training, ViT = vision transformer, BioClinicalBERT = clinical BERT pre-trained on biomedical notes, CheXpert = chest X-ray expert-labeled dataset from Stanford, BioViL-T = biomedical vision-language model with temporal modeling, CNN = convolutional neural network, CXR-BERT = BERT variant trained on chest X-ray reports, RRG = radiology report generation, SOTA = state-of-the-art, BioMedCLIP = biomedical CLIP-style pre-training using ViT and PubMedBERT, PubMedBERT = BERT pre-trained on PubMed abstracts, VQA = visual question answering, PMC = PubMed Central