. 2023 Mar 22;10:158. doi: 10.1038/s41597-023-02036-y

Table 7.

Performance of the best monomodal and multimodal approaches from Table 6 on coarse-grained medical instructional videos of MedVidCL dataset.

Models		mAP	F1 (macro)	F1 (micro)	F1 (weighted)
(1)	BigBird-Base	57.53	47.77	60.32	60.2
(2)	ViT + Transformer	27.2	25.86	43.26	45.53
(3)	L + V (ViT) + Transformer	28.02	26.12	41.39	44.3

Here (1), (2) and (3) represent the Monomodal (Language), Monomodal (Vision), and Multimodal (Language + vision) models, respectively.