. 2022 Sep 12;17(9):e0274300. doi: 10.1371/journal.pone.0274300

Table 1. Comparisons among our models with other published baselines.

Models	Batch size	Detecting modality	Loss	Optimizer	Learning rate	Acc.,	AUROC,	Acc.,	AUROC,
Models	Batch size	Detecting modality	Loss	Optimizer	Learning rate	dev set (n = 500)^*	dev set (n = 500) ^*	test set (n = 1000) ^*	test set (n = 1000) ^*
Image-Grid	32	Image	Cross entropy	AdamW	1.00E-05	0.500±0.045 (0.436–0.536)	0.516±0.027 (0.478–0.543)	0.511±0.023 (0.478–0.526)	0.514±0.018 (0.498–0.530)
Image-Region	32	Image	Cross entropy	AdamW	5.00E-05	0.513±0.032 (0.484–0.548)	0.549±0.030 (0.508–0.579)	0.531±0.023 (0.502–0.558)	0.561±0.039 (0.526–0.617)
Text BERT	64	Text	Cross entropy	AdamW	5.00E-05	0.569±0.020 (0.548–0.588)	0.625±0.047 (0.579–0.669)	0.586±0.024 (0.556–0.612)	0.639±0.006 (0.633–0.645)
Late Fusion	32	Image&Text	Cross entropy	AdamW	5.00E-05	0.589±0.031 (0.544–0.612)	0.641±0.040 (0.613–0.700)	0.619±0.011 (0.608–0.630)	0.679±0.018 (0.665–0.705)
ConcatBERT	32	Image&Text	Cross entropy	AdamW	1.00E-05	0.576±0.038 (0.540–0.616)	0.645±0.012 (0.629–0.655)	0.622±0.023 (0.588–0.636)	0.682±0.017 (0.659–0.696)
MMBT-Grid	32	Image&Text	Cross entropy	AdamW	1.00E-05	0.603±0.042 (0.544–0.644)	0.672±0.018 (0.654–0.696)	0.631±0.014 (0.616–0.650)	0.694±0.006 (0.687–0.700)
MMBT-Region	32	Image&Text	Cross entropy	AdamW	5.00E-05	0.605±0.059 (0.524–0.652)	0.649±0.067 (0.585–0.722)	0.642±0.032 (0.608–0.672)	0.690±0.046 (0.646–0.735)
ViLBERT	32	Image&Text	Cross entropy	AdamW	1.00E-05	0.633±0.020 (0.612–0.656)	0.717±0.035 (0.677–0.747)	0.659±0.007 (0.652–0.668)	0.732±0.015 (0.716–0.753)
Visual BERT	32	Image&Text	Cross entropy	AdamW	5.00E-05	0.638±0.023 (0.612–0.668)	0.722±0.010 (0.711–0.732)	0.664±0.013 (0.656–0.684)	0.748±0.011 (0.732–0.757)
ViLBERT CC	32	Image&Text	Cross entropy	AdamW	1.00E-05	0.656±0.009 (0.648–0.668)	0.730±0.035 (0.691–0.773)	0.664±0.009 (0.652–0.674)	0.739±0.016 (0.724–0.757)
Visual BERT COCO	32	Image&Text	Cross entropy	AdamW	5.00E-05	0.648±0.032 (0.608–0.676)	0.732±0.017 (0.711–0.752)	0.664±0.020 (0.646–0.692)	0.737±0.025 (0.711–0.770)
OSCAR+FC	50	Image&Tag&Text	Cross entropy	AdamW	5.00E-06	0.666±0.038 (0.626–0.706)	0.758±0.042 (0.703–0.803)	0.677±0.010 (0.664–0.689)	0.762±0.016 (0.749–0.786)
OSCAR+RF	50	Image&Tag&Text	Cross entropy	AdamW	5.00E-06	0.667±0.034 (0.618–0.698)	0.759±0.014 (0.745–0.777)	0.684±0.002 (0.682–0.686)	0.768±0.021 (0.737–0.784)

Footnotes: Acc., accuracy; AUROC, area under the receiver operating characteristic.

*Mean±standard error with the range was calculated from evaluations of four final models.