Skip to main content
. 2022 Dec 19;16:1086380. doi: 10.3389/fnins.2022.1086380

Table 4.

Unimodal audio/visual emotion recognition results with and without SEM.

Methods IEMOCAP CMU-MOSEI
Avg. Acc Avg. F1 Avg. WA Avg. F1
SER w/o SEM 0.752 0.463 0.628 0.424
w/ SEM 0.839 0.560 0.659 0.450
FER w/o SEM 0.796 0.512 0.631 0.429
w/ SEM 0.828 0.553 0.675 0.456

SER refers to speech emotion recognition, and FER denotes facial expression recognition. All frameworks follow the CNN-Transformer-MLP architecture, the difference is whether SEM is used in the CNN encoder. The bold values are indicated to highlight the best results.