Table 4.
Methods | IEMOCAP | CMU-MOSEI | |||
---|---|---|---|---|---|
Avg. Acc | Avg. F1 | Avg. WA | Avg. F1 | ||
SER | w/o SEM | 0.752 | 0.463 | 0.628 | 0.424 |
w/ SEM | 0.839 | 0.560 | 0.659 | 0.450 | |
FER | w/o SEM | 0.796 | 0.512 | 0.631 | 0.429 |
w/ SEM | 0.828 | 0.553 | 0.675 | 0.456 |
SER refers to speech emotion recognition, and FER denotes facial expression recognition. All frameworks follow the CNN-Transformer-MLP architecture, the difference is whether SEM is used in the CNN encoder. The bold values are indicated to highlight the best results.