Table 6.
Performance of the monomodal and multimodal approaches on MedVidCL test dataset.
| Models | Precision | Recall | F1-score | Precision (Med-Inst) | Recall (Med-Inst) | F1-score (Med-Inst) | |
|---|---|---|---|---|---|---|---|
| Monomodal (Language) | Linear SVC30 | 89.64 | 89.71 | 88.41 | 99.76 | 70.33 | 82.50 |
| SVM16 | 89.54 | 88.73 | 87.42 | 100.0* | 67.00 | 80.24 | |
| BERT-Base-Uncased31 | 92.82 | 93.23 | 92.91 | 95.98 | 87.50 | 91.54 | |
| RoBERTa-Base32 | 94.58 | 94.98 | 94.67 | 97.99 | 89.33 | 93.46 | |
| BigBird-Base17 | 95.58* | 95.96* | 95.68* | 98.19 | 90.67* | 94.28* | |
| Monomodal (Vision) | I3D + LSTM24,37 | 75.62 | 75.88 | 75.11 | 81.66 | 63.83 | 71.66 |
| ViT + LSTM33,37 | 82.07† | 81.16 | 80.49 | 89.62† | 67.67 | 77.11 | |
| I3D + Transformer24,27 | 75.18 | 75.41 | 74.43 | 83.14 | 60.83 | 70.26 | |
| ViT + Transformer27,33 | 81.76 | 82.06† | 81.26† | 89.25 | 69.17† | 77.93† | |
| Multimodal (Language + Vision) | L + V (I3D) + LSTM | 75.96 | 76.16 | 75.68 | 79.68 | 66.67 | 72.60 |
| L + V (ViT) + LSTM | 82.57 | 82.16 | 81.40 | 90.22 | 67.67 | 77.33 | |
| L + V (I3D) + Transformer | 74.74 | 75.10 | 74.80 | 76.23 | 69.50‡ | 72.71 | |
| L + V (ViT) + Transformer | 83.65‡ | 83.12‡ | 82.38‡ | 92.22‡ | 69.17 | 79.05‡ | |
The results shown here are not a comparison amongst the models but show the variety of the models used to benchmark the dataset. Here L and V denotes the Language and Vision, respectively. Precision, Recall, and F1-score denote macro average over all the classes. The best results amongst monomodal (language) approaches are highlighted with the * symbol. Similarly, we show the best monomodal (vision) and multimodal results with the † and ‡ symbols, respectively.