. 2021 Nov 18;21(22):7665. doi: 10.3390/s21227665

Table 3.

Quantitative evaluation of the different strategies on the facial emotion recognizer. Results are given at the video level. All the results are reported on eight emotions except those that appear with (*), that are reported in seven emotions, collapsing the ‘Neutral’ and ‘Calm’ emotions. In bold, the best model.

TL Strategy	Inputs	Models	With VAD (InaSpeech)	Accuracy ± 95% CI
-	-	Human perception [18]	-	75.00
-	-	ZeroR	-	13.33 ± 2.06
Feature Extraction (from pre-trained STN on AffectNet)	posteriors (7 classes)	Max. voting	No	30.49 * ± 2.38
		Max. voting	Yes	30.35 * ± 2.37
		Sequential (bi-LSTM)	No	38.87 ± 2.52
		Sequential (bi-LSTM)	Yes	39.75 ± 2.53
	fc50	Sequential (bi-LSTM)	No	50.40 ± 2.58
	fc50	Sequential (bi-LSTM)	Yes	48.77 ± 2.58
	flatten-810	Sequential (bi-LSTM)	No	53.85 ± 2.57
	flatten-810	Sequential (bi-LSTM)	Yes	51.70 ± 2.58
Fine-Tuning on RAVDESS	posteriors (8 classes)	Max. voting	No	54.20 ± 2.56
		Max. voting	Yes	55.07 ± 2.56
		Sequential (bi-LSTM)	No	55.82 ± 2.56
		Sequential (bi-LSTM)	Yes	56.87 ± 2.56
	fc50	Sequential (bi-LSTM)	No	46.48 ± 2.58
	fc50	Sequential (bi-LSTM)	Yes	46.13 ± 2.57
	flatten-810	Sequential (bi-LSTM)	No	54.14 ± 2.57
	flatten-810	Sequential (bi-LSTM)	Yes	57.08 ± 2.56