Table 3.
TL Strategy | Inputs | Models | With VAD (InaSpeech) |
Accuracy ± 95% CI |
---|---|---|---|---|
- | - | Human perception [18] |
- | 75.00 |
- | - | ZeroR | - | 13.33 ± 2.06 |
Feature Extraction (from pre-trained STN on AffectNet) |
posteriors (7 classes) |
Max. voting | No | 30.49 * ± 2.38 |
Yes | 30.35 * ± 2.37 | |||
Sequential (bi-LSTM) |
No | 38.87 ± 2.52 | ||
Yes | 39.75 ± 2.53 | |||
fc50 | Sequential (bi-LSTM) |
No | 50.40 ± 2.58 | |
Yes | 48.77 ± 2.58 | |||
flatten-810 | Sequential (bi-LSTM) |
No | 53.85 ± 2.57 | |
Yes | 51.70 ± 2.58 | |||
Fine-Tuning on RAVDESS |
posteriors (8 classes) |
Max. voting | No | 54.20 ± 2.56 |
Yes | 55.07 ± 2.56 | |||
Sequential (bi-LSTM) |
No | 55.82 ± 2.56 | ||
Yes | 56.87 ± 2.56 | |||
fc50 | Sequential (bi-LSTM) |
No | 46.48 ± 2.58 | |
Yes | 46.13 ± 2.57 | |||
flatten-810 | Sequential (bi-LSTM) |
No | 54.14 ± 2.57 | |
Yes | 57.08 ± 2.56 |