Skip to main content
. 2023 Mar 22;10:158. doi: 10.1038/s41597-023-02036-y

Table 6.

Performance of the monomodal and multimodal approaches on MedVidCL test dataset.

Models Precision Recall F1-score Precision (Med-Inst) Recall (Med-Inst) F1-score (Med-Inst)
Monomodal (Language) Linear SVC30 89.64 89.71 88.41 99.76 70.33 82.50
SVM16 89.54 88.73 87.42 100.0* 67.00 80.24
BERT-Base-Uncased31 92.82 93.23 92.91 95.98 87.50 91.54
RoBERTa-Base32 94.58 94.98 94.67 97.99 89.33 93.46
BigBird-Base17 95.58* 95.96* 95.68* 98.19 90.67* 94.28*
Monomodal (Vision) I3D + LSTM24,37 75.62 75.88 75.11 81.66 63.83 71.66
ViT + LSTM33,37 82.07 81.16 80.49 89.62 67.67 77.11
I3D + Transformer24,27 75.18 75.41 74.43 83.14 60.83 70.26
ViT + Transformer27,33 81.76 82.06 81.26 89.25 69.17 77.93
Multimodal (Language + Vision) L + V (I3D) + LSTM 75.96 76.16 75.68 79.68 66.67 72.60
L + V (ViT) + LSTM 82.57 82.16 81.40 90.22 67.67 77.33
L + V (I3D) + Transformer 74.74 75.10 74.80 76.23 69.50 72.71
L + V (ViT) + Transformer 83.65 83.12 82.38 92.22 69.17 79.05

The results shown here are not a comparison amongst the models but show the variety of the models used to benchmark the dataset. Here L and V denotes the Language and Vision, respectively. Precision, Recall, and F1-score denote macro average over all the classes. The best results amongst monomodal (language) approaches are highlighted with the * symbol. Similarly, we show the best monomodal (vision) and multimodal results with the and symbols, respectively.