Introducing a novel dataset for facial emotion recognition and demonstrating significant enhancements in deep learning performance through pre-processing techniques

Nursel Yalçin; Muthana Alisawi

doi:10.1016/j.heliyon.2024.e38913

. 2024 Oct 4;10(20):e38913. doi: 10.1016/j.heliyon.2024.e38913

Introducing a novel dataset for facial emotion recognition and demonstrating significant enhancements in deep learning performance through pre-processing techniques

Nursel Yalçin ^a, Muthana Alisawi ^b,^c,^⁎

PMCID: PMC11620061 PMID: 39640693

Abstract

Facial expression recognition (FER) plays a pivotal role in various applications, ranging from human-computer interaction to psychoanalysis. To improve the accuracy of facial emotion recognition (FER) models, this study focuses on enhancing and augmenting FER datasets. It comprehensively analyzes the Facial Emotion Recognition dataset (FER13) to identify defects and correct misclassifications. The FER13 dataset represents a crucial resource for researchers developing Deep Learning (DL) models aimed at recognizing emotions based on facial features. Subsequently, this article develops a new facial dataset by expanding upon the original FER13 dataset. Similar to the FER + dataset, the expanded dataset incorporates a wider range of emotions while maintaining data accuracy. To further improve the dataset, it will be integrated with the extended Cohn-Kanade (CK+) dataset.

This paper investigates the application of modern DL models to enhance emotion recognition in human faces. By training a new dataset, the study demonstrates significant performance gains compared with its counterparts. Furthermore, the article examines recent advances in FER technology and identifies critical requirements for DL models to overcome the inherent challenges of this task effectively. The study explores several DL architectures for emotion recognition in facial image datasets, with a particular focus on convolutional neural networks (CNNs). Our findings indicate that complex architecture, such as EfficientNetB7, outperforms other DL architectures, achieving a test accuracy of 78.9 %. Notably, the model surpassed the EfficientNet-XGBoost model, especially when used with the new dataset. Our approach leverages EfficientNetB7 as a backbone to build a model capable of efficiently recognizing emotions from facial images. Our proposed model, EfficientNetB7-CNN, achieved a peak accuracy of 81 % on the test set despite facing challenges such as GPU memory limitations. This demonstrates the model's robustness in handling complex facial expressions. Furthermore, to enhance feature extraction and attention mechanisms, we propose a new hybrid model, CBAM-4CNN, which integrates the convolutional block attention module (CBAM) with a custom 4-layer CNN architecture. The results showed that the CBAM-4CNN model outperformed existing models, achieving higher accuracy, precision, and recall metrics across multiple emotion classes. The results highlight the critical role of comprehensive and diverse data in enhancing model performance for facial emotion recognition.

Keywords: Deep learning architectures, Facial emotion recognition FER, FER13, Extended Cohn-Kanade (CK+), Convolutional neural network CNN, Features extraction, Attention mechanisms

1. Introduction

The problem of recognizing human emotions based on facial features has long been of interest to psychologists, neuroscientists, and computer scientists, as it is the foundation of effective social interactions and communications. Recently, after the development of DL techniques, there was a breakthrough in the field of emotion recognition based on facial images [1]. DL techniques have shown remarkable performance in detecting emotions through facial features by simulating the structure and functioning of the human brain. Moreover, DL architectures can learn from complex patterns, extract features from large data sets, and generalize learning capabilities to new data [2]. Recognizing emotions through facial features using DL techniques has emerged as a promising research area. Its applications span across robotics, mental health applications, and human-computer interaction [3].

A trained observer can generally recognize facial expressions consistently and nearly instantly [4]. Conversely, On the other hand, the interpretation of such emotional expressions by automatic systems is generally complex and challenging and still has many unanswered issues which demand vast research-effort [5].

FER has applications in various disciplines, including medicine, social sciences, automotive and consumer electronics, human-machine interaction, and human-robot interaction [6]. Facial emotion recognition is a multi-step process. Initially, a face image is acquired through a live or recorded camera. Then, segmenting the face from the image. Subsequently, the detected face undergoes normalization to remove any distortions [7].

Traditional FER approaches heavily rely on content-based methods. These methods typically employ mathematical features, templates, or classifiers based on hand-crafted features and various learning methods. These methods manually extract facial features such as the eyes, nose, and mouth. These features are often employed in conjunction with supervised learning approaches, including support vector machines (SVMs) and decision trees (DTs). Additionally, Gabor wavelets and histograms are often used for feature extraction from facial images, which are then fed into classifiers. Although these methods are simple to implement, they typically achieve low accuracy [8,9].

Convolutional neural networks (CNNs) have made significant advancements in the field of FER tasks. However, these methods using CNNs don't really capture the complex and important features needed to tell the difference between different facial expressions from a wide angle. Therefore, there is still ample opportunity to enhance the performance of current CNN models for facial expression recognition (FER) [10].

In the early days, the psychological models were pioneering works in the recognition of FER, especially Paul Ekman's six basic emotions [11]. FER has continued to be an active area of research in psychology and related fields. Key research focuses have included facial action units (FACS) proposed by Ekman and Friesen, facial geometry kinetics (FGK), and the facial action coding system created by Cohen, Kanade, and Cohn. Additionally, other artificial intelligence techniques like fuzzy logic, hidden Markov models (HMMs), neural networks, and SVMs were utilized. These focus on making the gaps between categories bigger, as opposed to machine learning algorithms like Lazy K-Star [12], which look for similarities between cases. Other combinations of these techniques are also being used. These techniques have shown promising results in improving the accuracy and robustness of FER systems [8].

To address the limitations of the current FER methods, we propose a novel approach combining CBAM, 4-stage CNN, and EfficientNetB7-CNN. Then, we accurately evaluate these models using a novel dataset and pre-processing methods. Specifically, we examine contemporary DL methods that use facial features to identify emotions and investigate how our newly introduced dataset influences these methods. The core contributions of this work are threefold:

•
Introduce the balanced FER2024_CK + dataset, which combines, preprocesses, and enhances existing datasets to improve performance and reliability.
•
Evaluate various DL models on this dataset, including our proposed models (CBAM-4CNN and EfficientNetB7-CNN), highlighting their strengths and weaknesses.
•
Provide a comprehensive cost-efficiency analysis of the proposed models, demonstrating significant enhancements in performance.

The article is structured as follows. Section 2 introduces some of the important related works. Our proposed methodology is detailed in Section 3.1, which is part of the border materials and methods section (Section 3). Experimental results are presented in Section 4 and followed by a detailed discussion in Section 5. Section 6 concludes the study and identifies our primary future direction.

2. Related works

Computer vision is a rapidly growing field that combines psychology, AI, and human-computer interaction. Recently, significant advancements have been made to develop systems capable of accurately recognizing human emotions facial expressions. This interdisciplinary field has practical implications for human-computer interaction, healthcare, and emotionally intelligent technologies.

Nawaf and Jasim [13] proposed a FER system using a CNN algorithm based on VGGNet. The model was trained on the FER2013 and FER + datasets, which were augmented to include additional images. The model validated its effectiveness in recognizing human emotion through a mean accuracy of 79 %. Also, the authors in Ref. [14] suggested an Emotion Recognition Convolutional Neural Networks (ERCNN) model designed specifically for identifying human emotions. Compared to the pre-trained models, ERCNN demonstrated its superiority in terms of accuracy, speed, and overall effectiveness. The ERCNN model achieved an accuracy of 87.133 % (82.648 %) in the public (private) test.

Punuri et al. [15] presented a new strategy derived from the Transfer Learning (TL) approach called EfficientNet-XGBoost. EfficientNet-XGBoost model integrates the strength of the EfficientNet and XGBoost algorithms. The authors demonstrated its superiority over the originality of the approach. In order to expedite the learning process of the network and address the issue of the vanishing gradient, they incorporated fully connected layers that utilize global average pooling, dropout, and dense operations. EfficientNet is optimized by substituting the higher dense layer(s) and integrating the XGBoost classifier, rendering it appropriate for FER. The suggested method has been thoroughly validated on four benchmark datasets: CK+, KDEF, JAFFE, and FER2013. To address the problem of data imbalance in certain datasets, including CK+ and FER2013, artificial data augmentation was employed using geometric modification techniques. Regardless of the characteristics of the datasets, the suggested strategy outperformed the counterparts, achieving accuracy rates of 100 %, 98 %, and 98 % for the first three datasets, respectively. However, the effectiveness of the proposed study is not as promising when trained and tested via FER2013 datasets (72.54%). Gupta et al. [16] introduced modified Inception-V3, VGG19, and RESENT50. To evaluate the models, the authors developed using three datasets: FER-2013, CK+, and RAF-DB. Proposed + ResNet-50 achieved the best performance of 73 %, 89 % and 76 %, respectively.

Choi and Lee [17] proposed a Deep Convolutional Neural Network (DCNN) ensemble classifier to enhance the recognition of facial expressions in uncontrolled environments. The approach employed a stochastic optimization technique to determine the weights of the ensemble, with the objective of minimizing energy and producing individual members. The DCNN ensemble classifier demonstrates competitive FER performance based on experiments conducted on three wild FER datasets (FER2013, SFEW2.0, and RAF-DB) and got an accuracy of 76.69 %, 58.68 %, and 87.13 %, respectively. Finally, Table 1 summarizes the related works mentioned in this section.

Table 1.

Comparative review of related works.

Study	Year	Architecture	Used Dataset	Val_accuracy
[17]	2021	DCNN	FER2013, SFEW2.0, and RAF-DB	76.69 %, 58.68 %, and 87.13 %, respectively
[13]	2022	CNN based on VGGNet	FER2013, FER+	79 %
[18]	2022	3 stage CNN	FER2013	82 %
[14]	2023	ERCNN	FER2013, FER+	82.64 %
[15]	2023	EfficientNet-XGBoost	CK+ and FER2013	100, 72.5 % respectively
[10]	2023	FER-CHC	FER2013	74.68 %
[16]	2023	Proposed + ResNet-50	FER-2013, CK+, RAF-DB	73 %, 89 % and 76 %, respectively.
[19]	2023	SSF-ViT (L)	FER2013	74.95 %
[20]	2023	CNN-based Inception-v3 architecture	FER2013	73.09 %
[21]	2023	Xception Net	FER2013	77.92 %
[22]	2023	EmoNAS	FER2013	67.9 %
[23]	2023	SSA-NET	FER2013	67.57 %
[24]	2024	EduViT based on the MobileViT architecture	FER2013	66.51 %
[25]	2024	Hybridized CNN-LSTM	FER2013	79.34 %
[26]	2024	Activation-matrix Triplet loss and pseudo label with Complementary Information	FER2013	71.62 %
[27]	2024	EfficientNet	FER2013	58.41 %
[28]	2024	Custom CNN	FER2013	57.4 %

	Surprise	Fear	Angry	Neutral	Sad	Disgust	Happy	Total
Training	3171	4097	3995	4965	4830	436	7215	28709
Testing	831	1024	958	1233	1247	111	1774	7178

Emotion	Angry	Disgust	Fear	Happy	Neutral	Sad	Surprise	Contempt	Confused	Sleepy	Not Face
Angry	3625	15	27	20	55	119	22	46	52	0	14
Disgust	9	396	4	0	0	7	3	16	0	0	1
Fear	376	1	2729	51	267	115	60	42	447	0	9
Happy	5	0	5	7121	28	12	21	1	2	0	20
Neutral	12	1	14	206	4420	136	7	14	18	122	17
Sad	124	5	38	18	111	4479	6	28	11	0	7
Surprise	10	2	14	25	9	4	3097	1	0	0	9
New Total	4161	420	2831	7441	4890	4872	3216	148	530	122	77

	Angry	Disgust	Fear	Happy	Neutral	Sad	Surprise	Contempt	Confused	Sleepy	Total
FER2024	5121	531	3850	9237	6034	6118	4043	172	538	148	35878
CK+	135	177	75	207	0	84	249	54	0	0	981
Total	5256	708	3925	9444	6034	6202	4292	226	538	148	36859

	Angry	Disgust	Fear	Happy	Neutral	Sad	Surprise	Contempt	Confused	Sleepy
Training 70 %	3680	496	2747	6612	4224	4342	3006	158	376	104
Private Testing 15 %	788	106	589	1416	905	930	643	34	81	22
Public Testing 15 %	788	106	589	1416	905	930	643	34	81	22

Emotion	Angry	Disgust	Fear	Happy	Neutral	Sad	Surprise	Contempt	Confused	Sleepy	NHF	Unclear	Eyeglasses
Angry	2728	1	24	12	14	24	15	59	0	0	43	726	34
Disgust	0	319	0	0	0	0	0	0	0	0	0	0	0
Fear	52	0	1251	9	1	36	99	5	0	0	66	1202	27
Happy	5	0	1	6080	1	0	16	2	0	0	34	358	115
Neutral	5	0	0	94	2638	0	3	7	0	7	0	1412	58
Sad	51	0	9	13	60	2182	6	66	0	0	58	1877	18
Surprise	19	0	41	95	10	7	2262	0	0	0	30	485	57
Contempt	0	0	0	0	0	16	0	186	0	0	0	1	4
Confused	0	0	0	0	0	0	0	0	376	0	0	0	0
Sleepy	0	0	0	0	0	0	0	0	0	104	0	0	0
New Total	2860	320	1326	6303	2724	2265	2401	325	376	111	231	6061	313

	Angry	Disgust	Fear	Happy	Neutral	Sad	Surprise	Contempt	Confused	Sleepy
Training 80 %	3292	494	1466	7141	3108	3043	2829	347	430	128
Private Testing 10 %	411	62	183	911	389	380	353	44	54	16
Public Testing 10 %	411	62	183	911	389	380	353	44	54	16
Total	4114	618	1832	8963	3886	3803	3535	435	538	160

Model	Image size	Image type	Epoch	Batch size	Accuracy
Model	Image size	Image type	Epoch	Batch size	FER13	FER24_CK+ 7 Emotions	FER24_CK+ 10 Emotions
VGG16	48 × 48	Grayscale	20	64	24.7 %	33.8 %	32.5 %
AlexNet	48 × 48	Grayscale	20	32	24.7 %	33.8 %	32.5 %
ResNet101	224 × 224	Grayscale	10	32	25.1 %	34 %	32.8 %
ResNet152	224 × 224	Grayscale	10	32	29.6 %	37.6 %	35.9 %
ResNet50	224 × 224	Grayscale	20	64	32.3 %	37.6 %	36.3 %
Standard CNN	48 × 48	Grayscale	50	64	37.4 %	45 %	42.2 %
InceptionV3	128 × 128	Grayscale	20	32	41.7 %	51.9 %	49.3 %
MobileNetV2	224 × 224	Grayscale	20	64	43.2 %	55.9 %	53.4 %
DenseNet121	48 × 48	Color	20	64	45.5 %	54.2 %	51.7 %
VGG19	48 × 48	Grayscale	20	64	56 %	67 %	62.6 %
Xception	71 × 71	Grayscale	20	32	62.6 %	69.9 %	69.5 %
EfficientNetB0	224 × 224	Grayscale	10	32	63.7 %	74.1 %	70.9 %
InceptionResNetV2	299 × 299	Grayscale	20	32	64.5 %	76.3 %	73.6 %
DenseNet169	224 × 224	Grayscale	20	32	65.4 %	78.4 %	74.2 %
EfficientNetB7	224 × 224	Grayscale	20	32	69.2 %	78.9 %	76.1 %

Rotation	Value	Zoom_random	Value	Flip_left_right	Value
Probability	0.7	Probability	0.5	Probability	0.5
Max_left_rotation	10	Percentage_area	0.8
Max_right_rotation	10

Stage	CPU Usage (%)	Memory Usage (%)	Time Taken (s)
Initial	0	4.2
Visualization Before Balancing	0.5	4.2	2.96439
Sample Display Before Enhancement	0.7	4.2	5.48402
Dataset Augmenting (10 emotions)	15.3	4.5	296.95
Visualization After Balancing	0.7	4.5	298.707
Sample Display After Enhancement	0.7	4.5	301.132
Zipping Dataset	2	4.5	14.6526

Stage	CPU Usage (%)	Memory Usage (%)	Time Taken (s)
Initial	0.5	4.3
Visualization Before Balancing	0.5	4.3	2.92086
Sample Display Before Enhancement	3.5	4.3	5.21949
Dataset Augmenting (7 emotions)	9.8	4.4	164.799
Visualization After Balancing	0.5	4.4	166.406
Sample Display After Enhancement	2	4.4	168.699
Zipping Dataset	0.3	4.5	10.6167

Backbone	Method name	Year	Accuracy	Val_accuracy
Pre-trained Models	Ours (EfficientNetB7-CNN)	–	93.66 %	78.72 %
	EfficientNet-XGBoost [15]	2023	Not reported	72.5 %
	Proposed + ResNet-50 [16]	2023	Not reported	73 %
	CNN-based Inception-v3 [20]	2023	Not reported	73.09 %
	Xception Net [21]	2023	Not reported	77.92 %
	FER-CHC [10]	2023	Not reported	74.68 %
	ResNet50 [27]	2024	59.41 %	54.67 %
	VGGNET [27]	2024	50.31 %	51.11 %
	EfficientNet [27]	2024	62.15 %	58.41 %
	SSER [26]	2024	Not reported	71.62 %
	ResNet50-CBAM-TCN [54]	2024	91 %	Not reported
	EduViT based MobileViT [24]	2024	Not reported	66.51 %
CNN based Models	Ours (CBAM-4CNN)	–	87.75 %	77.48 %
	Custom CNN [28]	2024	Not reported	57.4 %
	Hybridized CNN-LSTM [25]	2024	79.34 %	Not reported
	DCNN [27]	2024	82.56 %	65.68 %
	SSA-NET [23]	2023	Not reported	67.57 %
	EmoNAS [22]	2023	Not reported	67.9 %
	SSF-ViT(L) [19]	2023	74.95 %	71.7 %
	3 stage CNN [18]	2022	82 %	Not reported
	DCNN [17]	2021	Not reported	76.69 %

PERMALINK

Introducing a novel dataset for facial emotion recognition and demonstrating significant enhancements in deep learning performance through pre-processing techniques

Nursel Yalçin

Muthana Alisawi

Abstract

1. Introduction

2. Related works

Table 1.

3. Materials and methods

3.1. Proposed method

Fig. 1.

3.1.1. EfficientnetB7_CNN model

Fig. 2.

3.1.2. Convolutional block attention Module-4CNN (CBAM-4CNN) model

Fig. 3.

3.2. Datasets

Chart 1.

Table 2.

Fig. 4.

Table 3.

Chart 2.

Fig. 5.

Table 4.

Chart 3.

3.3. FER2013 preprocessing

Fig. 6.

Table 5.

Table 6.

3.4. New dataset FER2024

Table 7.

3.5. Dataset combination

Table 8.

Table 9.

Fig. 7.

Fig. 8.

Table 10.

Table 11.

Table 12.

3.6. Deep learning architectures

4. Experimental results

Fig. 9.

Fig. 10.

4.1. Performance evaluation of different DL models

Table 13.

Table 14.

4.2. Augmentation and enhancement

4.2.1. FER24_CK + augmentation

Table 15.

Fig. 11.

Fig. 12.

Table 16.

Table 17.

4.3. Transfer learning EfficientNetB7-CNN

Table 18.

Fig. 13.

Table 19.

4.4. CBAM-4CNN model

Table 20.

Fig. 14.

Table 21.

5. Discussion

Table 22.

6. Conclusion

Funding

Ethical approval

Data and code availability

CRediT authorship contribution statement

Declaration of competing interest

Acknowledgments

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases