. 2023 Mar 22;10:158. doi: 10.1038/s41597-023-02036-y

Table 6.

Performance of the monomodal and multimodal approaches on MedVidCL test dataset.

Models		Precision	Recall	F1-score	Precision (Med-Inst)	Recall (Med-Inst)	F1-score (Med-Inst)
Monomodal (Language)	Linear SVC³⁰	89.64	89.71	88.41	99.76	70.33	82.50
	SVM¹⁶	89.54	88.73	87.42	100.0*	67.00	80.24
	BERT-Base-Uncased³¹	92.82	93.23	92.91	95.98	87.50	91.54
	RoBERTa-Base³²	94.58	94.98	94.67	97.99	89.33	93.46
	BigBird-Base¹⁷	95.58*	95.96*	95.68*	98.19	90.67*	94.28*
Monomodal (Vision)	I3D + LSTM^24,37	75.62	75.88	75.11	81.66	63.83	71.66
	ViT + LSTM^33,37	82.07^†	81.16	80.49	89.62^†	67.67	77.11
	I3D + Transformer^24,27	75.18	75.41	74.43	83.14	60.83	70.26
	ViT + Transformer^27,33	81.76	82.06^†	81.26^†	89.25	69.17^†	77.93^†
Multimodal (Language + Vision)	L + V (I3D) + LSTM	75.96	76.16	75.68	79.68	66.67	72.60
	L + V (ViT) + LSTM	82.57	82.16	81.40	90.22	67.67	77.33
	L + V (I3D) + Transformer	74.74	75.10	74.80	76.23	69.50^‡	72.71
	L + V (ViT) + Transformer	83.65^‡	83.12^‡	82.38^‡	92.22^‡	69.17	79.05^‡

The results shown here are not a comparison amongst the models but show the variety of the models used to benchmark the dataset. Here L and V denotes the Language and Vision, respectively. Precision, Recall, and F1-score denote macro average over all the classes. The best results amongst monomodal (language) approaches are highlighted with the * symbol. Similarly, we show the best monomodal (vision) and multimodal results with the ^† and ^‡ symbols, respectively.