Skip to main content
. 2021 Nov 18;21(22):7665. doi: 10.3390/s21227665

Table 1.

Quantitative evaluation of the different strategies on speech emotion recognition. In bold, the best model.

TL Strategy Inputs Models With VAD
(InaSpeech)
Accuracy ± 95% CI
- - Human
perception [18]
- 67.00
- - ZeroR - 13.33 ± 2.06
Feature
Extraction
Deep-Spectrum embs.
from fc7 of AlexNet
SVC No 43.32 ± 2.56
Yes 45.80 ± 2.57
PANNs embs.
from CNN-14
SVC No 39.73 ± 2.53
Yes 37.22 ± 2.50
Fine
Tuning
Mel spectrograms AlexNet No 60.72 ± 2.52
Yes 61.67 ± 2.51
Mel spectrograms CNN-14 No 76.58 ± 2.18
Yes 75.25 ± 2.23