. 2022 Dec 19;16:1086380. doi: 10.3389/fnins.2022.1086380

Table 4.

Unimodal audio/visual emotion recognition results with and without SEM.

Methods		IEMOCAP		CMU-MOSEI
		Avg. Acc	Avg. F₁	Avg. WA	Avg. F₁
SER	w/o SEM	0.752	0.463	0.628	0.424
	w/ SEM	0.839	0.560	0.659	0.450
FER	w/o SEM	0.796	0.512	0.631	0.429
	w/ SEM	0.828	0.553	0.675	0.456

SER refers to speech emotion recognition, and FER denotes facial expression recognition. All frameworks follow the CNN-Transformer-MLP architecture, the difference is whether SEM is used in the CNN encoder. The bold values are indicated to highlight the best results.