Table 1. Comparisons among our models with other published baselines.
Models | Batch size | Detecting modality | Loss | Optimizer | Learning rate | Acc., | AUROC, | Acc., | AUROC, |
---|---|---|---|---|---|---|---|---|---|
dev set (n = 500)* | dev set (n = 500) * | test set (n = 1000) * | test set (n = 1000) * | ||||||
Image-Grid | 32 | Image | Cross entropy | AdamW | 1.00E-05 | 0.500±0.045 (0.436–0.536) |
0.516±0.027 (0.478–0.543) |
0.511±0.023 (0.478–0.526) |
0.514±0.018 (0.498–0.530) |
Image-Region | 32 | Image | Cross entropy | AdamW | 5.00E-05 | 0.513±0.032 (0.484–0.548) |
0.549±0.030 (0.508–0.579) |
0.531±0.023 (0.502–0.558) |
0.561±0.039 (0.526–0.617) |
Text BERT | 64 | Text | Cross entropy | AdamW | 5.00E-05 | 0.569±0.020 (0.548–0.588) |
0.625±0.047 (0.579–0.669) |
0.586±0.024 (0.556–0.612) |
0.639±0.006 (0.633–0.645) |
Late Fusion | 32 | Image&Text | Cross entropy | AdamW | 5.00E-05 | 0.589±0.031 (0.544–0.612) |
0.641±0.040 (0.613–0.700) |
0.619±0.011 (0.608–0.630) |
0.679±0.018 (0.665–0.705) |
ConcatBERT | 32 | Image&Text | Cross entropy | AdamW | 1.00E-05 | 0.576±0.038 (0.540–0.616) |
0.645±0.012 (0.629–0.655) |
0.622±0.023 (0.588–0.636) |
0.682±0.017 (0.659–0.696) |
MMBT-Grid | 32 | Image&Text | Cross entropy | AdamW | 1.00E-05 | 0.603±0.042 (0.544–0.644) |
0.672±0.018 (0.654–0.696) |
0.631±0.014 (0.616–0.650) |
0.694±0.006 (0.687–0.700) |
MMBT-Region | 32 | Image&Text | Cross entropy | AdamW | 5.00E-05 | 0.605±0.059 (0.524–0.652) |
0.649±0.067 (0.585–0.722) |
0.642±0.032 (0.608–0.672) |
0.690±0.046 (0.646–0.735) |
ViLBERT | 32 | Image&Text | Cross entropy | AdamW | 1.00E-05 | 0.633±0.020 (0.612–0.656) |
0.717±0.035 (0.677–0.747) |
0.659±0.007 (0.652–0.668) |
0.732±0.015 (0.716–0.753) |
Visual BERT | 32 | Image&Text | Cross entropy | AdamW | 5.00E-05 | 0.638±0.023 (0.612–0.668) |
0.722±0.010 (0.711–0.732) |
0.664±0.013 (0.656–0.684) |
0.748±0.011 (0.732–0.757) |
ViLBERT CC | 32 | Image&Text | Cross entropy | AdamW | 1.00E-05 | 0.656±0.009 (0.648–0.668) |
0.730±0.035 (0.691–0.773) |
0.664±0.009 (0.652–0.674) |
0.739±0.016 (0.724–0.757) |
Visual BERT COCO | 32 | Image&Text | Cross entropy | AdamW | 5.00E-05 | 0.648±0.032 (0.608–0.676) |
0.732±0.017 (0.711–0.752) |
0.664±0.020 (0.646–0.692) |
0.737±0.025 (0.711–0.770) |
OSCAR+FC | 50 | Image&Tag&Text | Cross entropy | AdamW | 5.00E-06 | 0.666±0.038 (0.626–0.706) |
0.758±0.042 (0.703–0.803) |
0.677±0.010 (0.664–0.689) |
0.762±0.016 (0.749–0.786) |
OSCAR+RF | 50 | Image&Tag&Text | Cross entropy | AdamW | 5.00E-06 | 0.667±0.034 (0.618–0.698) |
0.759±0.014 (0.745–0.777) |
0.684±0.002 (0.682–0.686) |
0.768±0.021 (0.737–0.784) |
Footnotes: Acc., accuracy; AUROC, area under the receiver operating characteristic.
*Mean±standard error with the range was calculated from evaluations of four final models.