Table 7.
Performance of the best monomodal and multimodal approaches from Table 6 on coarse-grained medical instructional videos of MedVidCL dataset.
| Models | mAP | F1 (macro) | F1 (micro) | F1 (weighted) | |
|---|---|---|---|---|---|
| (1) | BigBird-Base | 57.53 | 47.77 | 60.32 | 60.2 |
| (2) | ViT + Transformer | 27.2 | 25.86 | 43.26 | 45.53 |
| (3) | L + V (ViT) + Transformer | 28.02 | 26.12 | 41.39 | 44.3 |
Here (1), (2) and (3) represent the Monomodal (Language), Monomodal (Vision), and Multimodal (Language + vision) models, respectively.