Abstract
Goal: As an essential human-machine interactive task, emotion recognition has become an emerging area over the decades. Although previous attempts to classify emotions have achieved high performance, several challenges remain open: 1) How to effectively recognize emotions using different modalities remains challenging. 2) Due to the increasing amount of computing power required for deep learning, how to provide real-time detection and improve the robustness of deep neural networks is important. Method: In this paper, we propose a deep learning-based multimodal emotion recognition (MER) called Deep-Emotion, which can adaptively integrate the most discriminating features from facial expressions, speech, and electroencephalogram (EEG) to improve the performance of the MER. Specifically, the proposed Deep-Emotion framework consists of three branches, i.e., the facial branch, speech branch, and EEG branch. Correspondingly, the facial branch uses the improved GhostNet neural network proposed in this paper for feature extraction, which effectively alleviates the overfitting phenomenon in the training process and improves the classification accuracy compared with the original GhostNet network. For work on the speech branch, this paper proposes a lightweight fully convolutional neural network (LFCNN) for the efficient extraction of speech emotion features. Regarding the study of EEG branches, we proposed a tree-like LSTM (tLSTM) model capable of fusing multi-stage features for EEG emotion feature extraction. Finally, we adopted the strategy of decision-level fusion to integrate the recognition results of the above three modes, resulting in more comprehensive and accurate performance. Result and Conclusions: Extensive experiments on the CK+, EMO-DB, and MAHNOB-HCI datasets have demonstrated the advanced nature of the Deep-Emotion method proposed in this paper, as well as the feasibility and superiority of the MER approach.
Keywords: Multimodal emotion recognition, electroencephalogram, facial expressions, speech
I. Introduction
Human emotions can be understood as people's attitudes, experiences, and corresponding behavioral responses to the objective environment [1], [2], [3]. Emotions play an essential role in people's daily lives and work [4]. With the rapid development of multimedia and human-computer interaction applications, intelligent machines with emotion recognition have been widely used in medical assistance [5], driving safety [6], and other fields. The definition of emotion can be divided into two paradigms, i.e., discrete paradigm and multi-dimensional paradigm. The discrete paradigms refer to the categories of emotions that people describe in daily life, such as happiness, anger, depression, etc. In contrast, the most commonly used multi-dimensional paradigm is the arousal-valence 2D model proposed by Russell [7], arousal and valence are levels of excitement and positivity, respectively, and this definition method is conducive to our quantitative research on emotions. The ways of expressing emotions can be broadly classified into two categories: external representations, such as facial expressions and speech, and internal representations, such as electroencephalography (EEG) and heart rate [1].
Over the past decade, a majority of emotion recognition studies have been focused on unimodal emotion recognition (UER) using only one mode [8], [9], [10]. However, emotions are considered a complex representation that cannot be reliably captured with unimodal signals, since genuine emotion can be hidden by different facial expressions or tones [11]. However, even so, facial expressions and speech are still the dominant external channels for conveying emotion. One study [12] showed that these two modalities account for 93 of the emotional information in human communication, and they are critical for multimodal emotion recognition (MER) using external channels [13]. Some recent studies [14], [15], [16] have attempted to leverage multiple modalities to boost the performance of emotion recognition, which can demonstrate the complementarity of emotion among multiple modes. These fusion strategies still need internal representation modalities, and their reliability still needs to be improved. In a previous study [17], it was proposed that fusing facial expressions, speech, and EEG could be a promising direction for future research in emotion recognition. Inspired by this, on the basis of MER that integrates facial expressions and speech, this paper introduces EEG, a signal that is not subject to the individual subjective will, to improve the reliability of the emotion recognition method [11].
EEG signals extracted from the central nervous system can more accurately and objectively reflect changes in people's emotions than other signals [2], [11]. As mentioned above, emotions can be expressed from multiple dimensions, and facial expressions and speech, as the most critical external representations of emotions [12], should also be considered. This paper uses three modalities of facial expressions, speech, and EEG to study MER for the first time. Unlike previous MER methods, the proposed method considers the three most relevant external and internal representations of emotions, which has better accuracy and reliability.
For multimodal fusion, fusion methods can be divided into feature-level fusion and decision-level fusion [18]. We found that in previous studies [19], [20], [21], decision-level fusion methods are not only easy to implement but also exhibit better performance than feature-level fusion. For example, the winning methods of the EmotiW challenge were almost decision-level fusion [14]. However, the increase in patterns in decision-level fusion means that multiple models must be designed, leading to inefficient multi-pattern recognition algorithms that are difficult to port to mobile devices and provide real-time detection in daily use [22]. This implies that we need to pay attention to the model's size when designing sentiment recognition models rather than just striving for accuracy.
With the improvement of chip computing processing power and deep learning performance, many novel emotion recognition methods have emerged in recent years. Some mainstream neural network models have achieved good results in emotion recognition, such as CNN [4], LSTM [8], [20], DBN [23], and GCN [24]. These deep learning methods have gradually replaced traditional feature extraction methods as the primary research methods for emotion recognition. The proposed Deep-emotion recognition framework in this paper utilizes three deep learning models to extract emotional features from facial expressions, speech, and EEG, respectively. The decision-level fusion method is then applied to integrate the recognition results from each modality, resulting in a more comprehensive and accurate recognition rate. In addition, to prevent the final model from being too large due to excessive classification models, we reduced the number of model parameters as much as possible on the premise of ensuring the classification accuracy of each model. Our contributions in this paper can be summarized as follows:
-
•
This study is the first attempt to combine the multiple modalities of facial expressions, speech, and EEG for emotion recognition. In the decision-level fusion stage, we propose an optimal weight distribution algorithm. Compared with traditional equal-weight fusion, this method can better judge the reliability of each mode and thus effectively enhance the fusion performance.
-
•
In this paper, a carefully improved GhostNet [25] structure is proposed for facial expressions recognition (FER). This method can effectively alleviate the overfitting phenomenon of the original GhostNet in the training process, and effectively improve classification accuracy.
-
•
For speech emotion recognition (SER), we design a lightweight full convolutional neural network (LFCNN), which has good feature learning performance with only a few parameters. Reducing model parameters as much as possible is also a factor to be considered in the model design process for decision-level fusion requiring multiple classifiers.
-
•
In the work of EEG emotion recognition (EER), this paper designs a tree-like LSTM (tLSTM) model that can fuse multi-stage features. This model combines shallow and deep features in the feature extraction process and performs better.
The rest of this paper is organized as follows: Section II describes the proposed emotion recognition methods and related experiments. Next, the comprehensive experimental results and discussion are reported in Section III. Finally, Section IV presents the conclusion of this work.
II. Materials and Methods
A. Data Preprocessing
The method of data preprocessing can be found in the Supplementary Materials of this manuscript.
B. Deep Learning Model
1). The Improved GhostNet for FER
GhostNet is mainly composed of multi-layer Ghost bottlenecks, among which Ghost bottlenecks are mainly composed of the Ghost module. The structures of the Ghost module and Ghost bottleneck are shown in Fig. 2 and Fig. 1(a), respectively. Our work focused on improving the Ghost bottleneck architecture, as shown in Fig. 1(b). The original Ghost bottleneck is divided into stride = 1 and stride = 2 modes, which perform feature extraction from different scales to obtain different feature map sizes. However, different sizes of feature maps have certain reference value for subsequent feature extraction. Inspired by this, the Ghost bottleneck proposed in this paper combines the characteristics of these two modes to provide more comprehensive characteristics. The specific implementation method introduces a convolution for downsampling in the case of the original stride = 1 to obtain the same shape when the stride = 2. When the input shape is , the specific structure of the improved GhostNet is shown in Table I.
TABLE I. Structure of the Proposed Improved GhostNet Model for FER. EXP: Expansion Size. OUT: The Number of Output Channels. SE: Whether Using the SE Module.
Operator | Output | Ghost bottleneck Setting | ||
---|---|---|---|---|
Conv2d, 16, | (batch, 24, 24, 16) | - | - | - |
Ghost_bottleneck | (batch, 12, 12, 40) | 120 | 40 | True |
Dropout, 0.3 | (batch, 12, 12, 40) | - | - | - |
Ghost_bottleneck | (batch, 6, 6, 80) | 240 | 80 | False |
Dropout, 0.3 | (batch, 6, 6, 80) | - | - | - |
Ghost_bottleneck | (batch, 3, 3, 160) | 672 | 160 | True |
Dropout, 0.3 | (batch, 3, 3, 160) | - | - | - |
Ghost_bottleneck | (batch, 2, 2, 160) | 960 | 160 | False |
Dropout, 0.3 | (batch, 2, 2, 160) | - | - | - |
Conv2d, 256, | (batch, 2, 2, 256) | - | - | - |
Dropout, 0.3 | (batch, 2, 2, 256) | - | - | - |
GAVPool, Reshape | (batch, 1, 1, 256) | - | - | - |
Conv2d, 512, | (batch, 1, 1, 512) | - | - | - |
Dense, Softmax | (batch, 7) | - | - | - |
2). Architecture of LFCNN for SER
The overall structure of our proposed LFCNN is shown in Fig. 3, which is mainly composed of three parts: parallel convolution structure, residual structure, and serial convolution structure. Depthwise separable convolution (DSC) has been found in past research to have a smaller number of parameters than traditional convolution [26]. The success of Xception proves the superiority of DSC over traditional convolution, and we will use it to design the LFCNN. The convolutional layers mentioned later in this section are DSC. Further details on the structure of LFCNN can be found in the Supplementary Materials accompanying this manuscript.
3). Architecture of tLSTM for EER
Our proposed tLSTM structure is shown in Fig. 4. For the tree part, the LSTMs of the leaf nodes all have the same number of neurons to ensure that their output shapes are consistent since their outputs will be merged and fed into the sequence part. Our proposed tree structure consists of four levels, each representing a stage of features. The leaf nodes are located at different levels, which are used to fuse the features of each stage together to obtain a more comprehensive feature. It is worth noting that the output of the LSTM we use in the tree part is the output of the entire sequence, while the output of the LSTM in the sequence part is the last hidden layer. Finally, the network outputs the arousal and valence scores separately through the dense layer.
C. Proposed Decision-Level Fusion Strategy
To find the reliability of each mode, we developed an optimal weight distribution algorithm. Taking arousal score decision fusion as an example, assume that there are modes corresponding to regression models and that a total of trials are used for prediction. The predicted average arousal score for trial in the model is , . Let the weight set to be {0.00, 0.01, 0.02..., 0.98, 0.99, 1.00}, an array that starts at 0.00 and ends at 1.00 with a step size of 0.01. The root mean square error (RMSE) is used as a measure to evaluate the performance of the current weight distribution. When it is in the best performing weight distribution, RMSE should be the smallest, denoted as . In accordance with relevant provisions of the above, the optimal weight distribution algorithm steps are as follows:
Step 1: The weights of n modes are enumerated in . Let the weight of the mode be ; then, go to Step 2 when (1) is satisfied. The algorithm ends when the enumeration is finished and the optimal weight distribution is saved.
Step 2: Calculate the predicted arousal score under the current weight distribution. Assuming that the predicted arousal score of trial is , then the calculation formula can be expressed as:
Step 3: The RMSE of trials under the current weight distribution, denoted as , is calculated as (3), where is the actual arousal score of trial . By comparing the size relationship between and , when , the current weight distribution is considered to have better performance. Thus, is updated to , and the current weight distribution is saved. When , it is considered that the current weight distribution does not exhibit better performance. Regardless of the size relationship, Step 1 is performed again.
To provide a clear illustration of the algorithm's implementation, the flow chart of the algorithm execution as well as the pseudo-code for the case of fusion of three modes is presented in the Supplementary Materials of this manuscript.
D. Experiment
The setup and various details of the experiment can be found in the Supplementary Materials of this manuscript.
III. Results and Discussion
A. Results on CK+
Tenfold cross-validation was conducted on CK+ using both GhostNet and our improved GhostNet. The confusion matrix is shown in Fig. 5(a) and (b). The results show that our improved GhostNet achieved an average accuracy of 98.27, outperforming the traditional GhostNet (90.21).
Furthermore, we found that the overfitting phenomenon of the GhostNet model appears in the training process. In this regard, we introduced well-designed multiple dropout layers while modifying the Ghost bottleneck to alleviate the overfitting phenomenon. The curves of the accuracy and loss obtained during training of the GhostNet model on CK+ before and after improvement, as a function of epoch, can be found in the Supplementary Materials of this manuscript. The improved GhostNet achieved an average accuracy of 98.27, but the accuracy of fear expression was only 95, which may be because fear and contempt have similar features. Nevertheless, our proposed method has achieved advanced results in recent research. Table II shows the comparison with some recent studies. It can be seen from the table that the improved GhostNet proposed by us performs better than other classical classification models, which fully proves the superiority of our proposed method.
TABLE II. Comparison With Recent Studies on CK+ Datasets. Val:validation Method, Acc: Accuracy.
Literatures | Model | Val | Acc() |
---|---|---|---|
Nasri et al. 2020 [27] | Xception | 10-fold | 98.20 |
Chowdary et al. 2021 [28] | Vgg19 | - | 96.00 |
Inception-v3 | - | 94.20 | |
Priya et al. 2021 [29] | MobileNet | 10-fold | 96.00 |
Mishra et al. 2022 [30] | ResNet50 | 5-fold | 89.80 |
Shaik et al. 2022 [31] | CNN-Attention | 10-fold | 97.67 |
Ours | GhostNet | 10-fold | 90.21 |
Imporved GhostNet | 10-fold | 98.27 |
B. Results on EMO-DB
For experiments on EMO-DB, we achieve an average accuracy of 94.36 with an F1-scores of 94.38, which almost surpasses most recent studies. The size of our proposed model is only 2.28 MB, which is much smaller than other models and is more likely to be applied to future mobile devices. The number of parameters for each component of the LFCNN structure can be found in the Supplementary Materials accompanying the manuscript. Table III shows the comparison between our work and previous work. The table shows the prediction accuracy and the size of the model. According to the comparison table data, we can see that the superiority of our method was thus validated.
TABLE III. Comparison With Recent Studies on EMO-DB Datasets. Val: Validation Method, Acc: Accuracy.
C. Results on MAHNOB-HCI
The experimental results obtained in the MAHNOB-HCI dataset show that the tLSTM model has obtained advanced results in EER and the feasibility of the decision-level fusion method. Fig. 6 shows the verification results of subjects 1 to 15. Table IV shows the average RMSE of some subjects after fusion. We found that the fusion method of Deep-Emotion proposed in this paper has improved performance in both arousal and valence dimensions. However, the fusion method is only significantly higher than that of the speech mode (, paired test). We did not find that the fusion method outperformed the facial expressions or EEG mode (, paired test). In emotion recognition experiments, facial expressions are often associated with high volatility since subjects may deceive the machine by mimicking certain facial expressions. In this case, the gap between the error associated with facial expressions and the error of accurate emotion detection can be filled by adding information sources (e.g., EEG and speech). Furthermore, in the MAHNOB-HCI dataset, the subjects were asked to behave normally rather than mimic certain facial expressions, which may be the main reason we could not find solid statistical evidence indicating significant improvement after fusion. For example, in this study, the FER accuracies of subjects 2 and 13 are higher than the result of the fusion method because the fusion result combines multiple modes for comprehensive consideration.
TABLE IV. The Average RMSE Report of Some Subjects Under Our Proposed Decision-Level Fusion Method.
Subject Id | 1 | 4 | 9 | 15 | 18 | Average | ||
---|---|---|---|---|---|---|---|---|
Arousal | Scale | [1-9] | ||||||
[0-1] | ||||||||
Valence | Scale | [1-9] | ||||||
[0-1] |
In addition to having a relatively high accuracy rate, Deep-Emotion can also show relatively good robustness. For example, when subjects express facial expressions that are different from their real emotions, the results obtained by MER fusion will not deviate significantly from the real ones. This is because the subjects' EEG still represents their real emotion. A comparison of this study with other research on emotion recognition in the MAHNOB-HCI dataset can be found in the Supplementary Materials. Moreover, SER is a very challenging task in MAHNOB-HCI. This is because the speech signal in this dataset includes not only the voice of the subject but also the voice of the stimulus material, which makes it challenging to extract the voice of the subject. This may be the reason for our relatively average recognition rate in SER.
Currently, our work is primarily based on the analysis of open source datasets, and we have not conducted independent data collection to further verify our findings. In future research, we plan to design a standardized experimental paradigm to collect additional data from subjects to more thoroughly evaluate the capabilities of Deep-Emotion.
IV. Conclusion
In our work, we propose a new MER method named Deep-Emotion, based on deep learning techniques to develop emotion recognition models for facial expressions, speech, and EEG. An improved GhostNet is proposed for facial expressions, which effectively alleviates the overfitting phenomenon and dramatically improves the model's performance. An LFCNN model is developed for speech signals, which can greatly reduce the model size on the premise of ensuring recognition accuracy. For EEG signals, a tLSTM model that can better learn the emotional characteristics of each stage was designed. Furthermore, we designed an optimal weight distribution search algorithm to find the reliability of each mode and achieve decision-level fusion. Our proposed methods are tested with open source datasets in MER experiments. To the best of our knowledge, this study is the first attempt to combine facial expressions, speech, and EEG for MER. The experimental results obtained in multiple public datasets validate the feasibility of the proposed method. In future work, we can further improve the fusion method that can dynamically assign weight to each mode to enhance the overall robustness of the algorithm. Dynamic weight allocation is a method of assigning different weights to different modalities, or sources of information, in a multimodal learning system. This allows the model to assign higher weights to the modalities that contain more relevant and useful information and lower weights to the modalities that contain less relevant or noisy information. This can improve the performance of the model by focusing on the most useful information and filtering out the noise. Overall, we believe that dynamic weight allocation is a promising approach for multimodal learning, and we will explore its potential in future research.
Supplementary Materials
The Supplementary Materials of this manuscript include relate work on emotion recognition. The data preprocessing is described, and the full structure of LFCNN is detailed. The flow chart of the algorithm execution as well as the pseudo-code of the optimal weight distribution algorithm is also provided. The experimental process of this study is described in detail. Finally, supplementary explanations of the experimental results are provided. This document can be accessed in the “Media” section of IEEE Xplore.
Funding Statement
This work was supported in part by STI 2030–Major Projects under Grant 2022ZD0208900 and in part by the National Natural Science Foundation of China under Grants 62076103 and 62271217.
Contributor Information
Jiahui Pan, Email: panjh82@qq.com.
Weijie Fang, Email: fangweijie01@gmail.com.
Zhihang Zhang, Email: f804006@163.com.
Bingzhi Chen, Email: chenbingzhi.smile@gmail.com.
Zheng Zhang, Email: zhengzhang@hit.edu.cn.
Shuihua Wang, Email: sw546@leicester.ac.uk.
References
- [1].Abdullah S. M. S. A., Ameen S. Y. A., Sadeeq M. A., and Zeebaree S., “Multimodal emotion recognition using deep learning,” J. Appl. Sci. Technol. Trends, vol. 2, no. 2, pp. 52–58, 2021. [Google Scholar]
- [2].Zhang J., Yin Z., Chen P., and Nichele S., “Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review,” Inf. Fusion, vol. 59, pp. 103–126, 2020. [Google Scholar]
- [3].Chen B., Cao Q., Hou M., Zhang Z., Lu G., and Zhang D., “Multimodal emotion recognition with temporal and semantic consistency,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, no. 10, pp. 3592–3603, 2021. [Google Scholar]
- [4].Wu M., Su W., Chen L., Pedrycz W., and Hirota K., “Two-stage fuzzy fusion based-convolution neural network for dynamic emotion recognition,” IEEE Trans. Affect. Comput., vol. 13, no. 2, pp. 805–817, Apr.–Jun. 2022. [Google Scholar]
- [5].Meng W., Cai Y., Yang L. T., and Chiu W.-Y., “Hybrid emotion-aware monitoring system based on brainwaves for internet of medical things,” IEEE Internet Things J., vol. 8, no. 21, pp. 16014–16022, Nov. 2021. [Google Scholar]
- [6].Sukhavasi S. B., Sukhavasi S. B., Elleithy K., El-Sayed A., and Elleithy A., “A hybrid model for driver emotion detection using feature fusion approach,” Int. J. Environ. Res. Public Health, vol. 19, no. 5, 2022, Art. no. 3085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Russell J. A., “Affective space is bipolar,” J. Pers. Social Psychol., vol. 37, no. 3, pp. 345–356, 1979. [Google Scholar]
- [8].Sajjad M. et al. , “Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM,” IEEE Access, vol. 8, pp. 79861–79875, 2020. [Google Scholar]
- [9].Aftab A., Morsali A., Ghaemmaghami S., and Champagne B., “LIGHT-SERNET: A lightweight fully convolutional neural network for speech emotion recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2022, pp. 6912–6916. [Google Scholar]
- [10].Yao Z., Wang Z., Liu W., Liu Y., and Pan J., “Speech emotion recognition using fusion of three multi-task learning-based classifiers: HSF-DNN, MS-CNN and LLD-RNN,” Speech Commun., vol. 120, pp. 11–19, 2020. [Google Scholar]
- [11].Li X. et al. , “EEG based emotion recognition: A tutorial and review,” ACM Comput. Surv., vol. 55, pp. 1–57, 2022. [Google Scholar]
- [12].Mehrabian A., “Communication without words,” Psychol. Today, vol. 2, no. 4, pp. 53–56, 1968. [Google Scholar]
- [13].Middya A. I., Nag B., and Roy S., “Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities,” Knowl.-Based Syst., vol. 244, 2022, Art. no. 108580. [Google Scholar]
- [14].Zhou H., Du J., Zhang Y., Wang Q., Liu Q.-F., and Lee C.-H., “Information fusion in attention networks using adaptive and multi-level factorized bilinear pooling for audio-visual emotion recognition,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, no. 6, pp. 2617–2629, 2021. [Google Scholar]
- [15].Ma Y., Hao Y., Chen M., Chen J., Lu P., and Košir A., “Audio-visual emotion fusion (AVEF): A deep efficient weighted approach,” Inf. Fusion, vol. 46, pp. 184–192, 2019. [Google Scholar]
- [16].Huang Y., Yang J., Liao P., and Pan J., “Fusion of facial expressions and EEG for multimodal emotion recognition,” Comput. Intell. Neurosci., vol. 2017, pp. 1–8, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Maithri M. et al. , “Automated emotion recognition: Current trends and future perspectives,” Comput. Methods Programs Biomed., vol. 215, 2022, Art. no. 106646. [DOI] [PubMed] [Google Scholar]
- [18].He Z. et al. , “Advances in multimodal emotion recognition based on brain–computer interfaces,” Brain Sci., vol. 10, no. 10, 2020, Art. no. 687. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Fang Y., Rong R., and Huang J., “Hierarchical fusion of visual and physiological signals for emotion recognition,” Multidimensional Syst. Signal Process., vol. 32, no. 4, pp. 1103–1121, 2021. [Google Scholar]
- [20].Li R. et al. , “MindLink-eumpy: An open-source python toolbox for multimodal emotion recognition,” Front. Hum. Neurosci., vol. 15, 2021, Art. no. 621493. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Tan Y., Sun Z., Duan F., Solé-Casals J., and Caiafa C. F., “A multimodal emotion recognition method based on facial expressions and electroencephalography,” Biomed. Signal Process. Control, vol. 70, 2021, Art. no. 103029. [Google Scholar]
- [22].Zhang K., Li Y., Wang J., Cambria E., and Li X., “Real-time video emotion recognition based on reinforcement learning and domain knowledge,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 3, pp. 1034–1047, Mar. 2022. [Google Scholar]
- [23].Liu D., Chen L., Wang Z., and Diao G., “Speech expression multimodal emotion recognition based on deep belief network,” J. Grid Comput., vol. 19, no. 2, pp. 1–13, 2021. [Google Scholar]
- [24].Li J., Li S., Pan J., and Wang F., “Cross-subject EEG emotion recognition with self-organized graph neural network,” Front. Neurosci., vol. 15, 2021, Art. no. 611653. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Han K., Wang Y., Tian Q., Guo J., Xu C., and Xu C., “GhostNet: More features from cheap operations,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 1580–1589. [Google Scholar]
- [26].Chollet F., “XCEPTION: Deep learning with depthwise separable convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1251–1258. [Google Scholar]
- [27].Nasri M., Hmani M. A., Mtibaa A., Petrovska-Delacretaz D., Slima M. B., and Hamida A. B., “Face emotion recognition from static image based on convolution neural networks,” in Proc. IEEE 5th Int. Conf. Adv. Technol. Signal Image Process., 2020, pp. 1–6. [Google Scholar]
- [28].Chowdary M. K., Nguyen T. N., and Hemanth D. J., “Deep learning-based facial emotion recognition for human–computer interaction applications,” Neural Comput. Appl., vol. 33, pp. 1–18, 2021. [Google Scholar]
- [29].Priya R. B., Hanmandlu M., and Vasikarla S., “Emotion recognition using deep learning,” in Proc. IEEE Appl. Imagery Pattern Recognit. Workshop, 2021, pp. 1–5. [Google Scholar]
- [30].Mishra S., Joshi B., Paudyal R., Chaulagain D., and Shakya S., “Deep residual learning for facial emotion recognition,” in Proc. Mobile Comput. Sustain. Inform., 2022, pp. 301–313. [Google Scholar]
- [31].Shaik N. S. and Cherukuri T. K., “Visual attention based composite dense neural network for facial expression recognition,” J. Ambient Intell. Humanized Comput., vol. 13, pp. 1–14, 2022. [Google Scholar]
- [32].Chen M., He X., Yang J., and Zhang H., “3-D convolutional recurrent neural networks with attention model for speech emotion recognition,” IEEE Signal Process. Lett., vol. 25, no. 10, pp. 1440–1444, Oct. 2018. [Google Scholar]
- [33].Muppidi A. and Radfar M., “Speech emotion recognition using quaternion convolutional neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2021, pp. 6309–6313. [Google Scholar]
- [34].Kwon S. et al. , “ATT-Net: Enhanced emotion recognition system using lightweight self-attention module,” Appl. Soft Comput., vol. 102, 2021, Art. no. 107101. [Google Scholar]
- [35].Andayani F., Theng L. B., Tsun M. T., and Chua C., “Hybrid LSTM-Transformer model for emotion recognition from speech audio files,” IEEE Access, vol. 10, pp. 36018–36027, 2022. [Google Scholar]