. 2023 Feb 23;23(5):2455. doi: 10.3390/s23052455

Table 2.

Other classification methods.

Model Name	Dataset Used	Classification Method	Details
T5-3B [165]	SST (NLP)	Transformer and self-attention	The authors used transfer learning and self-attention to convert all text-based language problems into a text-to-text format. The authors compared the pre-training objectives, architectures, unlabeled datasets, and transfer methods of NLP. The classification accuracy on the SST dataset is 97.4%.
MT-DNN-SMART [166]	SST (NLP)	Transformer and smoothness inducing regularization	The authors proposed smoothness-induced regularization based on transfer learning to manage the complexity of the model. At the same time, a new optimization method was proposed to prevent over-updating. The classification accuracy on the SST dataset is 97.5%.
GRU [167]	CREMA-D (SER)	Self-supervised representation learning	The authors proposed a framework for learning audio representations guided by the visual modality in the context of audiovisual speech. The authors demonstrated the potential of visual supervision for learning audio representations; and achieved 55.01% SER accuracy on the CREMA-D dataset.
EmoAffectNet [168]	CREMA-D and AffectNet (FER)	CNN-LSTM	The authors proposed a flexible FER system using CNN and LSTM. This system consists of a backbone model and several temporal models. Every component of the system can be replaced by other models. The backbone model achieved an accuracy of 66.4% on the AffectNet dataset. The overall model achieved an accuracy of 79% on the CERMA-D dataset.
M2FNet [169]	IEMOCAP and MELD (multimodal)	Multi-task CNN and multi-head attention-based fusion	The multimodal fusion network proposed by the authors can extract emotional features from visual, audio, and textual modalities. The feature extractor was trained by an adaptive margin-based triplet loss function. The model achieved 67.85% accuracy and a 66.71 weighted average F1 score on the MELD dataset. Meanwhile, it achieved 69.69% accuracy and a 69.86 weighted average F1 score on the MELD dataset.
CH Fusion [170]	IEMOCAP (multimodal)	RNN and feature fusion strategy	The authors used RNN to extract the unimodal features of the three modalities of audio, video, and text. These unimodal features were then fused through a fully connected layer to form trimodal features. Finally, feature vectors for sentiment classification were obtained. The model achieved an F1 score of 0.768 and an accuracy rate of 0.765 on the IEMOCAP dataset.
EmotionFlow-large [171]	MELD (multimodal)	BERT model and Conditional random field (CRF)	The authors researched the propagation of emotions in dialogue emotion recognition. The authors utilized an encoder-decoder structure to learn user-specific features. Conditional random fields (CRF) were then applied to capture sequence information at the sentiment level. The weighted F1 score on the MELD dataset was 66.50.
FN2EN [172]	CK+ (FER)	DCNN	The authors proposed a two-stage training algorithm. In the first stage, high-level neuronal responses were modeled using probability distribution functions based on the fine-tuned face network. In the second stage, the authors conducted label supervision to improve the discriminative ability. The model achieved 96.8% (eight emotions) and 98.6% (six emotions) accuracy on the CK+ dataset.
Multi-task EfficientNet-B2 [173]	AffectNet (FER)	MTCNN and Adam optimization	In the article, the authors analyzed the behavior of students in the e-learning environment. The facial features obtained by the model could be used to quickly predict student engagement, individual emotions, and group-level influence. The model could even be used for real-time video processing on each student’s mobile device without sending the video to a remote server or the teacher’s PC. The model achieved 63.03% (eight emotions) and 66.29% (seven emotions) accuracy on the AffectNet dataset.
EAC [174]	RAF-DB (FER)	CNN and Class Activation Mapping (CAM)	The authors approached noisy label FER from the perspective of feature learning, and proposed Erase Attention Consistency (EAC). EAC does not require noise rate or label integration. It can generalize better to noisy label classification tasks with a large number of classes. The overall accuracy on the RAF-DB dataset was 90.35%.
BiHDM [175]	SEED (EEG signal)	RNNs	The authors proposed a model to learn the differential information of the left and right hemispheres of the human brain to improve EEG emotion recognition. The authors employed four directed recurrent neural networks based on two orientations to traverse electrode signals on two separate brain regions. This preserved its inherent spatial dependence. The accuracy on the SEED dataset reached 74.35%.
MMLatch [176]	CMU-MOSEI (multimodal)	LSTM, RNNs and Transformers	The neural architecture proposed by the authors could capture top-down cross-modal interactions. A forward propagation feedback mechanism was used during model training. The accuracy rate on the CMU-MOSEI dataset reached 82.4.