Skip to main content
. 2023 Mar 22;10:158. doi: 10.1038/s41597-023-02036-y

Table 7.

Performance of the best monomodal and multimodal approaches from Table 6 on coarse-grained medical instructional videos of MedVidCL dataset.

Models mAP F1 (macro) F1 (micro) F1 (weighted)
(1) BigBird-Base 57.53 47.77 60.32 60.2
(2) ViT + Transformer 27.2 25.86 43.26 45.53
(3) L + V (ViT) + Transformer 28.02 26.12 41.39 44.3

Here (1), (2) and (3) represent the Monomodal (Language), Monomodal (Vision), and Multimodal (Language + vision) models, respectively.