Skip to main content
. 2024 May 3;69(10):10TR01. doi: 10.1088/1361-6560/ad387d

Table 4.

Overview of LMs for captioning medical images. The asterisks (*) indicate terms that are either not present in the original paper or do not apply in this context.

References ROI Modality Dataset Model name Vision model Language model
Nicolson et al (2023) * x-ray, Ultrasound, CT, MRI Radiology objects in context (ROCO) CvT2DistilGPT2 CvT-21 DistilGPT2
Nicolson et al (2021) * CT, Ultrasoud, x-ray, Fluroscopy, PET, Mammography, MRI, Angiography ROCO, ImageCLEFmed Caption 2021 * ViT PubMedBERT
Kim et al (2023b) Brain CT Institutional * ResNet-50, EfficientNet-B5, DenseNet-201, and ConvNeXt-S DistilGPT2
Zheng and Yu (2023) * * IU-Xray * CDGPT (Conditioned Densely-connected Graph Transformer) AlignTrans (Alignment Transformer)
Wang and Li (2022) * * ImageCLEFmedical Caption 2022 BLIP Vision Transformer (ViT-B) BERT
Ding et al (2023) * Histopathology MIDOG 2022 CLIP, BLIP Vision Transformer (ViT-B) BERT
Wang et al (2023a) * * ImageCLEFmedical Caption 2022 BLIP Vision Transformer (ViT-B) BERT
Zhou et al (2023) Ultrasound, CT, x-ray, MRI ROCO BLIP-2 ViT-g/14 OPT2.7B