Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Feb 5;16:7323. doi: 10.1038/s41598-026-38734-x

Facial expression recognition via variational inference

Gang Lv 1,, JunLing Zhang 1, Chiki Tsoi 2
PMCID: PMC12923884  PMID: 41639187

Abstract

Facial expressions in the wild are rarely discrete; they often manifest as compound emotions or subtle variations that challenge the discriminative capabilities of conventional models. While psychological research suggests that expressions are often combinations of basic emotional units, most existing FER methods rely on deterministic point estimation, failing to model the intrinsic uncertainty and continuous nature of emotions. To address this, we propose POSTER-Var, a framework integrating a Variational Inference-based Classification Head (VICH). Unlike standard classifiers, VICH maps facial features into a probabilistic latent space via the reparameterization trick, enabling the model to learn the underlying distribution of expression intensities. Furthermore, we enhance feature representation by introducing layer embeddings and nonlinear transformations into the feature pyramid, facilitating the fusion of hierarchical semantic information. Extensive experiments on RAF-DB, AffectNet, and FER+ demonstrate that our method effectively handles fine-grained expression recognition, achieving state-of-the-art performance. The code has been open-sourced at: https://github.com/lg2578/poster-var.

Keywords: Facial expression recognition, Variational inference, Probabilistic model, Feature representation

Subject terms: Computational biology and bioinformatics, Mathematics and computing

Introduction

Facial expressions are the manifestation of emotions on the face and are the primary form of emotional expression. Facial expression recognition (FER) holds vast research potential and application worth in human-computer interaction, psychology, intelligent robotics, intelligent surveillance, virtual reality and synthetic animation.

In recent years, with the continuous development of deep learning, facial expression recognition has achieved remarkable research progress17. However, existing FER literature predominantly discretizes and orthogonalizes emotional states. By relying on deterministic point estimation approaches for coarse classification, these methods fail to capture the high-dimensional and continuous spectrum of human emotion. FACS8 decomposes facial expressions into combinations of multiple action units (AUs), each AU corresponds to the movement of a specific facial muscle or group of muscles, and the same AU may occur across different expressions. Psychological studies9 and previous FER work10,11 have also shown that most emotions occur as combinations, mixtures, or compounds of the basic emotions, and multiple emotions always have different intensities within a single facial image, especially in the real world, as show in Fig. 1. Calibrate the feature distribution within a single image and making the final decision is crucial for improving recognition accuracy. Salient feature suppression12 encourages the model to focus on weaker features by suppressing dominant ones. LDL13 introduce a simple but efficient label distribution learning method as a novel training strategy and leverage depthwise convolution to capture local and global-salient facial features.

Fig. 1.

Fig. 1

Mixed features that map to different expression classes coexisting in a facial image. Thicker connecting lines represent higher predicted probabilities for the corresponding class. Class Activation Maps (CAMs) are generated using Grad-CAM14, the heatmap shows which regions of the image contribute positively to a specific class, even if that class is not the model’s final prediction.

Inspired by variational autoencoder (VAE) module widely used in generative models, we propose a novel method that enables the model to better balance features corresponding to different expression classes. During training, the model performs reparameterization via the proposed Variational Inference-based Classification Head (VICH) to learn the underlying distribution of expression combinations. This method encourages the model to learn the probabilistic distribution of expression combinations. Heatmap visualizations demonstrate that the model is able to make decisions by considering broader regional features.

Variational Inference (VI)15 offers a principled framework for incorporating uncertainty into deep models. It is an approximation technique for Bayesian inference that transforms the problem of computing the intractable posterior distribution into an optimization task by approximating it with a simpler, tractable distribution. While VI has shown great success in generative modeling16, its application to classification tasks remains limited. We argue that previous methods typically decode the latent vector before feeding it into the classifier. For a pure classification task, this decoding step is redundant and compromises the model’s performance. Moreover, during inference, using the mean of the learned Gaussian distribution helps reduce the intrinsic variability of the features. So We introduced two improvements to the reparameterization process. First, sampling is applied only during training, while the learned distribution mean is output directly during inference. Second, the final fully connected classifier is removed, allowing the reparameterized output to serve directly as the prediction. Furthermore, we enhance multi-scale feature fusion by incorporating layer embedding and nonlinear transformation into the baseline fusion module. The layer embedding encodes the positional and semantic level of each feature map within the feature pyramid, allowing the model to better distinguish and integrate information from different scales. The nonlinear transformation enriches the representation capability of fused features, facilitating more effective learning of complex patterns.

Overall, our contributions are summarized as follows:

  • We propose a novel Variational Inference-based Classification Head (VICH). VICH is designed to learn the underlying distribution of expression combinations, thereby encouraging the model to calibrate the feature distribution and to make decisions based on broader regional features.

  • We enhance multi-stage feature fusion by incorporating layer embeddings and nonlinear transformations, which effectively harmonizes the semantic gaps between different levels and adaptively extracts task-relevant high-level abstractions within the feature pyramid.

  • Our method outperforms current SOTA approaches across multiple Facial Expression Recognition (FER) benchmarks, achieving accuracies of 92.76% on RAF-DB, 67.91% on AffectNet (7 classes), 64.27% on AffectNet (8 classes), and 91.89% on FER+.

Related work

Facial expression recognition

With the continuous advancement of deep learning technologies, significant progress has been made in the research of facial expression recognition. MHCNN3 uses multi-task learning to automatically crop edge-free faces and recognize facial expressions, age, gender. TransFER4 combines multi attention dropping and multi-head self attention dropping mechanisms to learn rich relation-aware local representations. MTSD-CF17 uses a multi-task self-distillation method with coarse- and fine-grained labels, providing additional guidance for the extraction of discriminative features. QCS1 uses cross similarity attention and quadruplet cross similarity to adaptively mine discriminative features within the same class while simultaneously separating interfering features across different classes. ArcFace2 introduces an additive angular margin loss to further improve the discriminative power of the face recognition model and to stabilise the training process. POSTER5 combines pre-trained facial landmark detector7 with image features detector2 through a two-stream pyramidal cross-fusion transformer. POSTER++6 removes the image-to-landmark branch from the original two-stream design of POSTER, performs multi-scale feature extraction directly from the image backbone as well as from the facial landmark detector, it significantly reduces model parameters and computational cost while slightly improving model performance.

In summary, the aforementioned FER studies predominantly adopt deterministic point estimation approaches. However, these methods often struggle with the inherent ambiguity of facial expressions and the label noise present in large-scale datasets. By reducing a complex emotional state to a single hard label, deterministic models fail to capture the subtle transitions between different emotions and are sensitive to subjective annotation biases, which limits their robustness in real-world scenarios.

Variational inference-based classification network

In machine learning, parameter estimation methods are generally categorized into point estimation and Bayesian inference. The former yields a single optimal parameter value, while the latter models parameters as probability distributions to capture uncertainty15. VI can be viewed as an approximate form of Bayesian inference, where the intractable posterior is replaced by a parameterized distribution.

Given the great success of the VI in generative tasks, some studies have also applied VI in classification tasks. AEVB35 uses an improved parameter reparameterization technique that leads to better performance of variational inference in classification tasks. AAE18 is a novel framework for speech emotion recognition that employs variational inference of latent variables and reconstruction of the speech signal. The VAE-based classifier19 removes the decoder and directly connects the latent variables to a data classifier to perform the learning task, aiming to jointly optimize the encoder and the classifier with end-to-end training. FRA20 is a face representation augmentation method, shifts its focus towards manipulating the face embeddings generated by any face representation learning algorithm to create new embeddings representing the same identity and facial emotion but with an altered posture.

The architectural designs of these VI-based approaches provide valuable insights for improving our POSTER-Var model. By eliminating both the decoder and the final fully connected (FC) classifier used in conventional VI-based classification models, we introduce a novel classification head that substantially improves model performance and streamlines the overall architecture.

Attention mechanism

In deep learning, attention mechanisms often introduce element-wise multiplication as a core operation, allowing neural networks to dynamically emphasize or suppress different parts of the learned representation. For instance, in the Squeeze and Excitation block21, the output of the excitation module is multiplied with the original feature map to reweight channels according to their relative importance. Similarly, CBAM22 applies both channel and spatial attention maps via multiplicative scaling, thereby enabling the model to focus on salient information from multiple perspectives. ViT23 treats an input image as a sequence of fixed-size patches and uses a dot-product self-attention mechanism to compute weighted outputs. Micro_NesT24 uses a shallow feature extraction module and a hierarchical attention extraction module, enabling information interaction between different patches through aggregation modules. MFD25,26 is proposed to integrate features in the whole training set by memory-attention layers, which encourages the heterogeneous features with the same identity to present higher similarity.

Taken together, fusing multiple attention mechanisms allows the model to capture multi-scale and multi-dimensional features, enhancing representational capacity and generalization. In our proposed method, four different attention mechanisms are effectively integrated to enhance model performance.

Method

Baseline

We adopt POSTER++ as the baseline, as it significantly reduces the model parameters and computational cost while achieving slightly better performance than POSTER. POSTER++ employs IR502 as an image backbone to extract image features at three different scales, while MobileFaceNe7 is used to obtain the landmark features at the corresponding scales.

Let the input image Inline graphic, where 3 denotes the number of channels, h and w are the height and width of the image. In baseline, the image features Inline graphic as well as the landmark features Inline graphic are fused using global context window-based cross-attention27, and then concatenated along the channel dimension. The fused features Inline graphic are subsequently processed by a lightweight two-layer ViT to capture long-range dependencies, followed by a feed-forward network for classification.

Architecture

We propose POSTER-Var, which extends baseline from two pivotal perspectives. Firstly, we introduce a layer-embeding feature fusion module. Secondly, we design a classification head based on variational inference. Unlike previous studies that feed either the reconstructed output or the latent variables into a separate classifier, our method directly treats the reparameterized representations as the final classification outputs during training. As illustrated in Fig. 2, the components highlighted with bold lines represent the improvements introduced over the baseline model.

Fig. 2.

Fig. 2

Our proposed POSTER-Var architecture for FER.

A detailed explanation of the figure can be found in the following subsection. Compared with the baseline, the learnable positional embedding Inline graphic has a size of only Inline graphic, and the VICH module is only Inline graphic. Despite the negligible increase in model size and computational cost, these components effectively improve the model’s performance.

Attention-based multi-stage feature representation

In POSTER-Var, various attention mechanisms are employed. Features from different feature extractors are first fused using global cross-attention:

graphic file with name d33e447.gif 1

Here,subscript Inline graphic denotes different feature layers. Inline graphic and Inline graphic are generated by applying a linear projection with learnable weights to the image features Inline graphic. In contrast to the standard self-attention mechanism, in our method the Inline graphic is obtained by reshaping the landmark features Inline graphic without applying a learnable linear projection:

graphic file with name d33e477.gif 2

In the second stage, the model adds the input embeddings with the layer positional embedding vector using broadcasting, to incorporate sequential positional information:

graphic file with name d33e482.gif 3

Inline graphic is the learnable layer positional embedding, Inline graphic refers to the corresponding embedding layer, which applies different convolution operations to normalize different layers. In the third stage, Inline graphic is further processed by a 2 layers ViT to model global contextual relationships and get the representation vector Inline graphic. In the fourth stage, Inline graphicis then refined via an enhanced Squeeze-and-Excitation (SE) module to adaptively recalibrate and enhance informative feature channels:

graphic file with name d33e507.gif 4

Inline graphic denotes element-wise multiplication, Inline graphic denotes the Sigmoid activation function, Inline graphic denotes the Rectified Linear Unit activation function; Inline graphic are the weight matrices of the two fully connected layers.

VI-based classifier

The VI module incorporates the reparameterization trick, is a technique commonly employed in generative models to sample latent variables from a learned distribution. In contrast, we repurpose this mechanism for classification tasks, allowing probabilistic reasoning and uncertainty quantification in the decision process. During the training phase, the module samples from a Gaussian distribution parameterized by the predicted mean and log-variance, introducing stochasticity while preserving gradient flow through the sampling process.

graphic file with name d33e532.gif 5

Here,Inline graphic denotes random noise sampled from the standard multivariate normal distribution , Inline graphic and Inline graphic are learnable vectors generated by the encoder network. Inline graphic represents the mean of the approximate posterior distribution Inline graphic, indicating the central location of the latent variable Inline graphic conditioned on the input Inline graphic. Inline graphic represents the standard deviation of this distribution, capturing the uncertainty or spread around the mean. These parameters are used to define a diagonal Gaussian distribution in the latent space, from which Inline graphic is sampled using the reparameterization trick.

In the testing phase, to ensure stable and deterministic predictions, the module bypasses sampling and directly outputs the mean as the final latent representation for classification.This is the key difference between our method and previous classification approaches based on VI.

Experiments

Datasets

We verify the effectiveness of POSTER-Var on several FER benchmarks, such as RAF-DB28, AffectNet29 and FER+30.

RAF-DB. Real-world Affective Faces Datasets(RAF-DB)28, developed by Beijing University of Posts and Telecommunications, comprises approximately 30,000 facial images collected from thousands of individuals in unconstrained environments. In this study, we utilized the RAF-DB Basic Emotion Subset, a widely adopted benchmark dataset consisting of 15,339 real-world facial images, each annotated with one of seven basic emotion classes: Happy, Sad, Surprise, Anger, Disgust, Fear, and Neutral. To ensure annotation consistency and reliability, each image was labeled by approximately 40 independent raters, and the final label was derived using the Expectation-Maximization (EM) algorithm. According to the standard partition, the dataset is divided into 12,271 training images and 3,068 test images, making it well-suited for training and evaluating facial expression recognition models.

AffectNet. AffectNet29 developed by University of Denver, is currently the largest publicly available dataset in the field of FER, containing approximately 1 million facial images associated with emotion labels. The dataset primarily includes 8 classes of basic emotions: Neutral, Happy, Anger, Sadness, Fear, Surprise, Disgust, and Contempt. In addition to these annotated classes, AffectNet also includes three extra labels: None for faces that do not express any recognizable emotion, Uncertain for ambiguous expressions that annotators could not confidently classify, and No-face for images where no face was detected. To ensure the quality and reliability of model training, we mainly use the 7-class version of AffectNet (excluding Contempt) and the 8-class version in this study. AffectNet (7 cls) consists of 283,902 training images and 3,500 validation images (500 images per category). AffectNet (8 cls) consists of 287,652 training images and 4,000 validation images (500 images per category).

FER+. FER+30 developed by Microsoft Research, is an enhanced version of the original FER2013 dataset,it contains 28,709 training, 3,589 validation, and 3,589 test images. In FER+, each image has been labeled by 10 crowd-sourced taggers, which provide better quality ground truth for still image emotion than the original FER labels. Having 10 taggers for each image enables researchers to estimate an emotion probability distribution per face. This allows constructing algorithms that produce statistical distributions or multi-label outputs instead of the conventional single-label output. Folllowing1,30, we utilized FER+ to filter out samples labeled as ’no face’ or ’unknown’ and reported the overall accuracy on the test set.

Experiment details

Training is conducted for 200 epochs using the AdamW optimizer31 to ensure robust generalization and stable convergence. Beyond standard data augmentations like random horizontal flipping and random erasing, the optimization process on RAF-DB, AffectNet, and FER+ is supervised by a joint loss function that leverages both Cross-Entropy (CE) and Kullback-Leibler (KL) divergence. All experiments were conducted on a single NVIDIA RTX 3090 via PyTorch 2.5. To ensure the comparability of results, all methods were trained under identical conditions. The detailed training configurations and hyperparameters are provided in Table 1.

Table 1.

Training configurations.

Configs RAF-DB AffectNet FER+
Optimizer AdamW AdamW AdamW
Init LR 9e-6 2e-5 3e-5
Weight Decay 1e-4 1e-4 1e-4
Batch Size 48 48 48
Max Epochs 250 200 200
LR Schedule Exp. (Inline graphic) Exp. (Inline graphic) Exp. (Inline graphic)
Augmentation Resize: Inline graphic Resize: Inline graphic Resize: Inline graphic
H. Flip H. Flip H. Flip
Rot. (Inline graphic) Rot. (Inline graphic)
Random Crop (Inline graphic) Random Crop (Inline graphic)
Color Jitter (0.2) Color Jitter (0.2) Color Jitter (0.2)
Normalize() Normalize() Normalize()
Random Erasing Random Erasing Random Erasing
Classes 7 7/8 8
Loss Function CE + Inline graphic KL CE + Inline graphic KL CE + Inline graphic KL

Table 2 presents the performance comparison between our method and recent advanced approaches in the field of emotion recognition. Overall, emotion recognition techniques demonstrate continuous performance improvement across multiple benchmark datasets. POSTER-Var achieves state-of-the-art (SOTA) performance across several benchmarks, with accuracies of 92.76% on RAF-DB, 67.91% on AffectNet (7 classes), and 91.89% on FER+. These results consistently surpass the leading DCS method, which achieves 92.57%, 67.66%, and 91.41% respectively. The model also achieves a competitive 64.27% accuracy on the 8-class AffectNet, aligning with top-tier SOTA results. These results underscore the model’s exceptional capability in characterizing complex facial expressions. Such gains are primarily attributed to our probabilistic modeling of expression variation, which empowers the framework to effectively capture nuanced, subject-specific differences.

Table 2.

Comparison with SOTA methods.

Methods Year RAF-DB AffectNet (7 cls) AffectNet (8 cls) FER+
PSR32 CVPR 2020 88.98 63.77 60.68 89.75
EfficientFace13 AAAI 2021 88.36 63.70 60.23
Meta-Face2Exp33 CVPR 2022 88.54 64.23
POSTER5 ICCV 2023 92.05 67.31 63.34 91.62
MFER34 T-AFFC 2024 92.08 67.06 63.15 91.09
POSTER++6 PR 2025 92.21 67.49 63.77
DCS1 AAAI 2025 92.57 67.66 64.40 91.41
MTSD-CF17 ESWA 2025 92.63 66.26
OursInline graphic 2026 92.76 67.91 64.27 91.89

*Detailed training logs and reproducibility results are available at: https://swanlab.cn/@lezi.

Bold values indicate the best performance.

Ablation study

To evaluate the effectiveness of the proposed layer embedding and VICH module, we conduct extensive ablation studies on three benchmark facial expression recognition datasets: RAF-DB, AffectNet (7 and 8 classes), and FER+. The results are summarized in Table 3. Inference time is calculated as the average of 1000 runs on a single NVIDIA 3090 GPU. Full POSTER-Var Model achieves the best results across all datasets, RAF-DB: 92.76%, AffectNet (7 cls): 67.91%, AffectNet (8 cls): 64.27%, FER+: 91.89% with negligible computational overhead, maintaining an inference time nearly identical to the baseline.

Table 3.

Ablation results of POSTER-Var.

Methods RAF-DB AffectNet (7 cls) AffectNet (8 cls) FER+ Inf. Time (ms)
Ours 92.76 67.91 64.27 91.89 1.502
w/o Layer Emb. 92.66 67.85 64.24 91.85 1.502
w/o VI Module 92.50 67.66 64.02 91.69 1.492
Baseline 92.21 67.49 63.77 91.62 1.491

Bold values indicate the best performance.

Inline graphic. Removing the layer positional embedding leads to a consistent performance drop. On RAF-DB, accuracy decreases slightly to 92.66%. On AffectNet (7 cls) and (8 cls), accuracies drop to 67.85% and 64.24%, respectively. On FER+, accuracy decreases slightly to 91.85%. This suggests that the layer embedding helps improve the model’s capacity to capture hierarchical feature representations.

Inline graphic. Disabling the VICH module results in a more significant performance decline. RAF-DB drops to 92.50%, and AffectNet (7 cls) and (8 cls) decline to 67.66% and 64.02%, on FER+ accuracy falls to 91.69%. This indicates that the VICH module plays a vital role in modeling uncertainty and enhancing generalization, especially on more complex datasets like AffectNet and FER+.

Both the layer embedding and VICH module are crucial to the success of POSTER-Var. Their removal consistently degrades performance, confirming their complementary contributions to improving expression recognition accuracy. Notably, the VICH module appears slightly more impactful, particularly in datasets with greater variation and class imbalance like AffectNet.

Visualization

We conducted a visual analysis comparing the baseline and POSTER-Var(ours) on RAF-DB. Figure 3 shows attention visualization on facial images of different classes, include visualized facial landmarks and class activation maps. We can see that both models focus on similar regions, indicating that they are both able to learn the key features. However, the activation regions produced by POSTER-Var are more extensive and better aligned with key facial landmarks than those of the baseline. This broader attention helps the model capture the uncertainty of facial expressions and make decisions based on more comprehensive regional features and reducing the likelihood of misclassification.

Fig. 3.

Fig. 3

Attention visualization on facial images of different classes. Recognisable faces in the figure have been replaced by their dataset indices to comply with privacy policies, label #xxxx denotes the image indexed xxxx in the RAF-DB test set.

The more detailed experimental results of POSTER-Var on RAF-DB are presented in Table 4 and Fig. 4 The class distributions in the training and validation sets of RAF-DB are relatively consistent, and the classification performance of individual classes tends to correlate with the number of training samples. Nevertheless, our model still achieves satisfactory precision for classes with fewer samples, such as sad, fear, and neutral.

Table 4.

Sample distribution and performance per expression Class.

Suprise Anger Sad Neutral Fear Happy Disgust
Training samples 1290 705 1982 2524 281 4772 717
Testing samples 329 162 478 680 74 1185 160
Recall 91.79% 86.42% 92.68% 93.53% 70.27% 96.79% 78.75%
Precision 92.35% 90.91% 89.68% 90.47% 85.25% 97.20% 84.56%

Fig. 4.

Fig. 4

Confusion matrix of ours method on RAF-DB.

From Fig. 4, we observe that the neutral class(label=3) exhibits a significantly higher false positive rate compared to the happy class(label=5). The neutral class has 70 false positives, far exceeding the 38 of the happy class, resulting in a considerably higher false positive rate (9.92% vs. 3.21%). This suggests that the model is more prone to misclassify other emotions as neutral. However, the neutral class contains only about half as many training samples as the happy class, indicating that this phenomenon is not due to class imbalance.

Benefiting from the ability of the VICH module to learn the underlying distribution of expression combinations, we can easily plot the expression feature distribution of a given image, as shown in Fig. 5. The x-axis represents the expression intensity predicted by the model, and the class with the highest intensity among the seven categories is taken as the final classification result. The baseline output (indicated at the origin) incorrectly classifies the image as sad instead of neutral. In contrast, our model produces the correct classification. The reparameterization strategy employed during training encourages the model to evaluate images across a broader range of intensity values, strengthens the calibration of expression features, and enlarges inter-class discriminative distances.

Fig. 5.

Fig. 5

Normal distributions of seven emotions learned by VICH for a given image. Points and solid curves denote the outputs of the baseline and POSTER-Var, respectively. The final prediction is determined by the expression category with the highest intensity value. Recognisable faces in the figure have been replaced by their dataset indices to comply with privacy policies, label #xxxx denotes the image indexed xxxx in the RAF-DB test set.

Conclusions

In this paper, we addressed the limitation of deterministic point estimation in capturing the complexity of real-world facial expressions. By acknowledging that expressions are often combinations of basic emotions, we proposed POSTER-Var, incorporating a VI-based Classification Head. This approach fundamentally shifts the learning paradigm from fitting specific points to modeling feature distributions, thereby quantifying the uncertainty inherent in compound expressions. Coupled with our enhanced multi-scale feature fusion, the proposed method achieves superior performance on benchmark datasets. Our work suggests that probabilistic modeling is a promising direction for the next generation of fine-grained and robust Affective Computing systems. Future research will focus on integrating Domain Generalization (DG) frameworks with our variational architecture. Specifically, we aim to explore disentangled representation learning to effectively separate emotion-specific latent variables from identity-related nuisance factors. This will ensure that the learned feature distributions are more invariant across different datasets, ultimately facilitating the deployment of POSTER-Var in diverse, real-world human-computer interaction applications.

Author contributions

Gang lv: Conceptualization, Methodology, Writing-Original draft preparation, Investigation, Software, Validation Junling Zhang: Conceptualization, Writing- Reviewing and Editing Chiki Tsoi:Validation,Provided valuable guidance–particularly on improving the figures

Funding

This work was supported by funding from Zhejiang Office Philosophy and Social Sciences Planning Project (24NDJC04Z), the 3rd Batch of Scientific Research Innovation Teams of Zhejiang Open University. Jinhua Science and Technology Bureau (2025-4-178). The funders had no role in the design of the study, collection and analysis of data, writing of the manuscript, or decision to submit the manuscript for publication.

Data availability

The RAF-DB dataset is available from the original authors upon request for non-commercial research purposes. Researchers affiliated with academic institutions may request access by contacting the authors as described at http://whdeng.cn/RAF/model1.html. The FER+ dataset is available at https://github.com/microsoft/FERPlus. The AffectNet dataset can be requested from the original authors at https://mohammadmahoor.com/pages/databases/affectnet/ by eligible researchers (e.g., Principal Investigators) subject to a signed license agreement.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Wang, C., Chen, L., Wang, L., Li, Z. & Lv, X. Qcs: Feature refining from quadruplet cross similarity for facial expression recognition. In Proceedings of the AAAI conference on artificial intelligence, vol. 39, pp. 7563–7572 (2025).
  • 2.Deng, J., Guo, J., Xue, N. & Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4690–4699 . http://openaccess.thecvf.com/content_CVPR_2019/html/Deng_ArcFace_Additive_Angular_Margin_Loss_for_Deep_Face_Recognition_CVPR_2019_paper.html (2019).
  • 3.Savchenko, A.V. Facial expression and attributes recognition based on multi-task learning of lightweight neural networks. In 2021 IEEE 19th International symposium on intelligent systems and informatics (SISY). IEEE, pp. 119–124 https://ieeexplore.ieee.org/abstract/document/9582508/ (2021).
  • 4.Xue, F., Wang, Q. & Guo, G. Transfer: Learning relation-aware facial expression representations with transformers. in: Proceedings of the IEEE/CVF International Conference on Computer vision, pp. 3601–3610. http://openaccess.thecvf.com/content/ICCV2021/html/Xue_TransFER_Learning_Relation-Aware_Facial_Expression_Representations_With_Transformers_ICCV_2021_paper.html (2021) (Accessed 2025-04-10).
  • 5.Zheng, C., Mendieta, M. & Chen, C. Poster: A pyramid cross-fusion transformer network for facial expression recognition. In: Proceedings of the IEEE/CVF International conference on computer vision, pp. 3146–3155. https://openaccess.thecvf.com/content/ICCV2023W/AMFG/html/Zheng_POSTER_A_Pyramid_Cross- Fusion_Transformer_Network_for_Facial_Expression_Recognition_ICCVW_2023_paper.html (2023) (Accessed 2025-04-10).
  • 6.Mao, J. et al. POSTER++: A simpler and stronger facial expression recognition network. Patt. Recognit.157, 110951. 10.1016/j.patcog.2024.110951. (2025) (Accessed 2025-03-10) . [Google Scholar]
  • 7.Chen, C. PyTorch Face Landmark: A fast and accurate facial landmark detector. Opensource software available at https://github.com/cunjian/pytorch_face_landmark, 27 (2021).
  • 8.Ekman, P. & Friesen, W.V. Facial action coding system. Environmental Psychology & Nonverbal Behavior (1978).
  • 9.Plutchik, R. A general psychoevolutionary theory of emotion. In: Theories of Emotion, pp. 3–33. Elsevier https://www.sciencedirect.com/science/article/pii/B9780125587013500077 (1980).
  • 10.Zhou, Y., Xue, H. & Geng, X. Emotion Distribution Recognition from Facial Expressions. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1247–1250. ACM, Brisbane Australia. 10.1145/2733373.2806328 (2015).
  • 11.Jia, X., Zheng, X., Li, W., Zhang, C. & Li, Z. Facial emotion distribution learning by exploiting low-rank label correlations locally. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9841–9850. http://openaccess.thecvf.com/content_CVPR_2019/html/Jia_Facial_Emotion_Distribution_Learning_by_Exploiting_Low-Rank_Label_Correlations_Locally_CVPR_2019_paper.html (2019) (Accessed 2025-11-10).
  • 12.Yang, S., Yang, X., Wu, J. & Feng, B. Significant feature suppression and cross-feature fusion networks for fine-grained visual classification. Sci. Rep.14(1), 24051. 10.1038/s41598-024-74654-4 (2024) . (Accessed 2025-12-02). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Zhao, Z., Liu, Q. & Zhou, F. Robust lightweight facial expression recognition network with label distribution training. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 3510–3519. https://ojs.aaai.org/index.php/aaai/article/view/16465 (2021) (Accessed 2025-04-10).
  • 14.Gildenblat, J. contributors: PyTorch library for CAM methods. GitHub. https://github.com/jacobgil/pytorch-grad-cam (2021).
  • 15.Zhang, C., Bütepage, J., Kjellström, H. & Mandt, S. Advances in Variational Inference. IEEE Trans. Pattern Anal. Mach. Intell.41(8), 2008–2026. 10.1109/TPAMI.2018.2889774. (2019) (Accessed 2025-04-12). [DOI] [PubMed] [Google Scholar]
  • 16.Van Den Oord, A. & Vinyals, O. Neural discrete representation learning. Advances in neural information processing systems 30. (2017) (Accessed 2025-04-17).
  • 17.Zhang, Z., Li, X., Guo, K. & Xu, X. Facial expression recognition based on multi-task self-distillation with coarse and fine grained labels. Expert Syst. Appl.281, 127440. 10.1016/j.eswa.2025.127440 (2025) (Accessed 2025-07-10). [Google Scholar]
  • 18.Parthasarathy, S., Rozgic, V., Sun, M. & Wang, C. Improving emotion classification through variational inference of latent variables. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7410–7414. IEEE, https://ieeexplore.ieee.org/abstract/document/8682823/ (2019) (Accessed 2025-04-12).
  • 19.Chamain, L. D., Qi, S. & Ding, Z. End-to-end image classification and compression with variational autoencoders. IEEE Internet Things J.9(21), 21916–21931. 10.1109/JIOT.2022.3182313 (2022) (Accessed 2025-03-14). [Google Scholar]
  • 20.Hashemifar, S., Marefat, A., Hassannataj Joloudari, J. & Hassanpour, H. Enhancing face recognition with latent space data augmentation and facial posture reconstruction. Expert Syst. Appl.238, 122266. 10.1016/j.eswa.2023.122266 (2024) (Accessed 2025-07-10). [Google Scholar]
  • 21.Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141. http://openaccess.thecvf.com/content_cvpr_2018/html/Hu_Squeeze-and-Excitation_Networks_CVPR_201 8_paper.html (2018) (Accessed 2025-04-28).
  • 22.Woo, S., Park, J., Lee, J.-Y. & Kweon, I.S. CBAM: Convolutional Block Attention Module. In: Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part VII, pp. 3–19. Springer, Berlin, Heidelberg. 10.1007/978-3-030-01234-2_1 (2018).
  • 23.Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G. & Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021).
  • 24.He, J. et al. Micro_nest: multi-scale attention enhanced micro-expression recognition framework. Expert Syst. Appl.290, 128372. 10.1016/j.eswa.2025.128372 (2025) (Accessed 2025-07-10). [Google Scholar]
  • 25.Lu, Z., Lin, R. & Hu, H. Tri-level modality-information disentanglement for visible-infrared person re-identification. IEEE Trans Multim26, 2700–2714 (2023) (Accessed 2025-11-08). [Google Scholar]
  • 26.Lu, Z., Lin, R. & Hu, H. Disentangling modality and posture factors: Memory-attention and orthogonal decomposition for visible-infrared person re-identification. IEEE Trans. Neural Netw. Learn. Syst.36(3), 5494–5508 (2024). [DOI] [PubMed] [Google Scholar]
  • 27.Hatamizadeh, A., Yin, H., Heinrich, G., Kautz, J. & Molchanov, P. Global context vision transformers. In: International conference on machine learning, pp. 12633–12646. PMLR. https://proceedings.mlr.press/v202/hatamizadeh23a.html (2023).
  • 28.Li, S., Deng, W. & Du, J. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2852–2861. http://openaccess.thecvf.com/content_cvpr_2017/html/Li_Reliable_Crowdsourcing_and_CVPR_2017_paper.html (2017) (Accessed 2025-03-19).
  • 29.Mollahosseini, A., Hasani, B. & Mahoor, M. H. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput.10(1), 18–31 (2017). [Google Scholar]
  • 30.Barsoum, E., Zhang, C., Ferrer, C.C. & Zhang, Z. Training deep networks for facial expression recognition with crowd-sourced label distribution. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 279–283. ACM, Tokyo Japan. 10.1145/2993148.2993165. (2016) (Accessed 2025-04-26).
  • 31.Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. In: International Conference on Learning Representations (ICLR) (2019).
  • 32.Vo, T.-H., Lee, G.-S., Yang, H.-J. & Kim, S.-H. Pyramid with super resolution for in-the-wild facial expression recognition. IEEE Access8, 131988–132001. 10.1109/ACCESS.2020.3010018 (2020) (Accessed 2025-06-10). [Google Scholar]
  • 33.Zeng, D., Lin, Z., Yan, X., Liu, Y., Wang, F. & Tang, B. Face2Exp: Combating Data Biases for Facial Expression Recognition, pp. 20291–20300. https://openaccess.thecvf.com/content/CVPR2022/html/Zeng_Face2Exp_Combating_Data_Biases_for_Facial_Expression_Recognition_CVPR_2022_paper.html (2022) (Accessed 2025-06-10).
  • 34.Xu, J., Li, Y., Yang, G., He, L. & Luo, K. Multiscale facial expression recognition based on dynamic global and static local attention. IEEE Trans. Affect. Comput.10.1109/TAFFC.2024.3458464 (2024) (Accessed 2025-11-08).40881843 [Google Scholar]
  • 35.Kingma, D. P. & Welling, M. Auto-encoding variational bayes. Preprint at https://arxiv.org/abs/1312.6114https://arxiv.org/abs/1312.6114 (2013).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The RAF-DB dataset is available from the original authors upon request for non-commercial research purposes. Researchers affiliated with academic institutions may request access by contacting the authors as described at http://whdeng.cn/RAF/model1.html. The FER+ dataset is available at https://github.com/microsoft/FERPlus. The AffectNet dataset can be requested from the original authors at https://mohammadmahoor.com/pages/databases/affectnet/ by eligible researchers (e.g., Principal Investigators) subject to a signed license agreement.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES