Benchmarking quantum kernels and modern vision models for compound facial expression recognition

Mangaras Yanu Florestiyanto; Herman Dwi Surjono; Handaru Jati

doi:10.1038/s41598-026-41514-2

. 2026 Feb 27;16:11261. doi: 10.1038/s41598-026-41514-2

Benchmarking quantum kernels and modern vision models for compound facial expression recognition

Mangaras Yanu Florestiyanto ^1,^2,^✉, Herman Dwi Surjono ¹, Handaru Jati ¹

PMCID: PMC13049136 PMID: 41760860

Abstract

We present a unified, compute-accounted comparison of seven pipelines for compound facial-expression recognition on RAF-DB: two classical hybrids (ResNet50–SVM, VGGFace–SVM), two modern baselines (EfficientNetV2-S, ViT-B/16), and three quantum-enhanced hybrids (QCNN, QKNN, QSVM). Beyond top-1 accuracy, we report feature-extraction (FX), training, and per-sample classification time to expose accuracy–efficiency trade-offs. ViT-B/16 achieves the highest accuracy (63.13%) with very low FX (~ 32.84 s) but at the cost of longer training time; EfficientNetV2-S is competitive (60.9%) with a short training time but higher FX (~ 2056.92 s). Among quantum hybrids, QSVM offers the best accuracy (54.97%) at moderate FX (~ 61.6 s), QKNN yields the most deployment-friendly FX (~ 24.47 s; 36.02% accuracy), and QCNN is FX-minimal (~ 11.9 s) but accuracy-limited (35.69%). Confusions cluster along fear–surprise and sadness–disgust, suggesting AU-aware local attention, margin-shaping objectives, and fairness-oriented augmentation. Overall, QSVM is the accuracy-leading quantum option under moderate budgets, QKNN suits tight latency envelopes, and EfficientNet/ViT remain strong when compute is ample. The protocol, ablations, and statistical tests (McNemar, BCa CIs, Cliff’s δ) support reproducible, decision-oriented benchmarking. We do not claim near-term “quantum advantage”; instead we provide a compute-accounted benchmark and a feasibility-oriented analysis. Additional analyses include a strictly matched classical SVM baseline for quantum-kernel attribution, a lightweight kernel-shaping validation (fusion + regularisation), and a hardware-normalized cost discussion motivating simulator-based experiments at dataset scale.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-026-41514-2.

Keywords: Compound facial expressions, EfficientNetV2-S, Vision transformer, Quantum machine learning, RAF-DB

Subject terms: Engineering, Mathematics and computing

Introduction

Facial-expression recognition (FER) is moving from laboratory settings to decision contexts that demand both reliability and speed—clinical triage, safety monitoring, social robotics, and conversational interfaces^1–5. In those environments, compound expressions (e.g., fearfully surprised, sadly disgusted, angrily surprised) appear more often than textbook basic emotions. They originate from the co-activation of multiple affective components and rarely align with crisp boundaries⁶. Affective psychology offers principled explanations for this complexity. Such as Russell’s Circumplex Model, which situates emotions along valence–arousal axes rather than as isolated labels, while Scherer’s Component Process Model treats emotion as the output of dynamic appraisal processes^7,8. In this view, compound expressions occupy intermediate or transitional regions of affective space, producing partially overlapping facial action units (AUs) that even state-of-the-art models struggle to interpret. Beyond representational ambiguity, computational efficiency is a practical constraint: many FER deployments operate under tight latency and power budgets (e.g., on-device inference), where training dynamics, feature-extraction time, and end-to-end throughput matter as much as raw accuracy^9–11.

The Real-world Affective Faces Database (RAF-DB) is widely used for both basic and compound categories and intentionally captures in-the-wild variability (pose, illumination, occlusion, demographic diversity), increasing ecological validity while exacerbating class imbalance and label ambiguity¹². In such conditions, models often confuse pairs with shared action unit (AU) patterns—e.g., fear–surprise (wide eyes, raised brows) or sadness–disgust (downturned lip corners, nasolabial changes)—and conventional metrics may conceal semantically important error structure. These realities motivate an evaluation that balances predictive quality with computational cost, while explicitly analysing where and why models fail.

On the computational side, the field has coalesced around two modern baselines. First, convolutional neural networks (CNNs) remain highly competitive. Classic backbones (e.g., VGGFace, ResNet-50) deliver strong hierarchical features but can be heavy at inference. Newer Efficient CNNs (e.g., EfficientNet-V2) use compound scaling of depth/width/resolution, inverted bottlenecks, and squeeze-and-excitation to improve accuracy-per-FLOP and memory footprint^13,14. Second, Vision Transformers (ViT) replace local convolution with global self-attention over image patches, capturing long-range dependencies that are attractive for spatially dispersed facial cues in compound expressions¹⁵. Yet, ViTs can be data- and compute-hungry, and even efficient CNNs incur nontrivial feature-extraction latency under strict deployment constraints.

A complementary line of work explores quantum-enhanced models to reshape the efficiency–accuracy frontier. Quantum SVMs (QSVMs) leverage quantum feature maps and kernels in high-dimensional Hilbert spaces to increase class separability at training/inference costs that can be favourable under certain regimes¹⁶. Quantum CNNs (QCNNs) substitute or augment convolution/pooling with parameterised quantum circuits to perform parallelised feature transformations¹⁷. Quantum k-NN (QKNN) accelerates similarity search and neighbour selection using amplitude encoding and Grover-style procedures (e.g., swap tests and amplitude estimation), potentially reducing time complexity for high-dimensional comparison^18,19. While near-term quantum devices remain resource-limited, hybrid (quantum–classical) pipelines can already be benchmarked for their feature-extraction time, training stability, and accuracy on realistic datasets such as RAF-DB.

Against this background, we select seven representative approaches to probe how architectural choices trade off computational efficiency and predictive performance for compound FER:

ResNet50–SVM and VGGFace–SVM (classical hybrids). These pipelines pair strong convolutional features with margin-based classification, offering transparent baselines for accuracy, feature-extraction time, and generalisation under limited compute. They probe how far classical features, combined with a light classifier, can go with ambiguous, overlapping classes.
EfficientNet (EfficientNetV2-S) and Vision Transformer (ViT) (modern baselines). EfficientNet represents the modern CNN family optimised for accuracy-per-compute via compound scaling; ViT represents attention-based modelling of long-range facial dependencies. Together, they quantify the frontier of purely classical deep architectures under realistic latency and memory budgets.
Hybrid QCNN, QKNN, and QSVM (quantum hybrids). These probe whether quantum-assisted feature mappings and search can compress computation (especially feature-extraction time) while preserving or improving separation margins in compound classes—particularly among the known hard pairs (fear–surprise; sadness–disgust).

Methodologically, we standardise data splits, preprocessing, and evaluation on RAF-DB; control training schedules and input resolutions per model family; and account for compute via feature-extraction time, training time, and classification latency. Beyond top-1 accuracy, we analyse confusion matrices to surface semantically meaningful error patterns, and we report statistical significance where applicable to avoid over-interpreting small gaps. This protocol is designed to make cross-model comparisons fair, reproducible, and decision-relevant for practitioners facing deployment constraints.

This paper makes three contributions:

Unified evaluation protocol with compute accounting. We present a consistent pipeline for RAF-DB compound FER across seven architectures—classical hybrids, modern CNN/ViT, and quantum hybrids—with matched preprocessing, controlled hyperparameters, and explicit reporting of feature-extraction time, training time, and classification latency. This allows accuracy to be interpreted alongside realistic computational budgets.
Comprehensive comparative results. We show that quantum-assisted models reduce feature-extraction cost relative to deep baselines, with QSVM attaining the best overall accuracy among the quantum family and QKNN offering the most favourable accuracy–latency balance for real-time scenarios. EfficientNet and ViT remain strong modern baselines but require higher compute, making it clear when hybrids are preferable.
Error structure and statistical validity. Through confusion-matrix analysis, we identify persistent confusions in fear–surprise and sadness–disgust blends shared across models, and we report statistical tests/effect sizes to contextualise observed differences. We discuss model-specific failure modes and derive targeted remedies (e.g., AU-aware local features for ViT, quantum kernel shaping for QSVM, class-balanced augmentation).

Taken together, our study positions compound FER as a joint problem of representational adequacy and computational efficiency. By triangulating classical, modern, and quantum-hybrid approaches under a unified protocol, we provide a decision-oriented map of trade-offs that can guide both academic benchmarking and practical deployment in affective computing.

Related work

Classical and modern FER. Early FER pipelines combined hand-crafted descriptors (e.g., LBP, HOG, SIFT) with margin-based or instance-based classifiers (SVM/KNN), trading representational power for speed and interpretability^20,21. Deep CNNs overturned that trade-off: VGGFace and ResNet families learned hierarchical cues (AUs, texture, shape) that generalise across pose and illumination, but at non-trivial compute and memory cost^22,23. More recent “modern CNNs” (e.g., EfficientNet family) improve the accuracy–efficiency frontier via compound scaling and squeeze-and-excitation, yet still depend on large input resolutions and long training schedules^20,24. Attention models such as ViT and Swin Transformer extend receptive fields globally, often surpassing CNNs on in-the-wild benchmarks after large-scale pre-training; however, they shift the bottleneck from convolution FLOPs to tokenisation and multi-head attention, with training stability and data hunger that complicate deployment^25,26. Overall, the literature shows accuracy gains from modern architectures, but there is mixed evidence on whether these gains persist under strict latency, energy, or edge-device constraints—precisely the regime many FER applications require.

Compound-expression recognition. Most state-of-the-art reports optimise for the six basic emotions; far fewer treat compound expressions (e.g., fearfully surprised, sadly disgusted) where AU overlap compresses inter-class margins²⁷. Studies on RAF-DB and related “in-the-wild” corpora document recurring confusions along shared valence–arousal axes (fear ↔ surprise, sadness ↔ disgust), class imbalance, and annotation ambiguity—factors that inflate headline accuracy while masking failure modes^28,29. Temporal cues (onset/offset dynamics), occlusion, and culture-specific display rules further erode robustness^9,30. Even strong backbones (ResNet, EfficientNet, ViT) tend to overfit dominant compounds without targeted rebalancing or region-aware attention^29,31. Recent attempts—landmark-guided attention, local–global fusion, and curriculum/contrastive training—show incremental gains but often at additional compute or with brittle hyperparameters^28,31,32. The consensus emerging from these works is clear: improving compound-class separability requires architectures and training objectives that explicitly model fine-grained AU interactions and class geometry, rather than simply deeper networks.

Quantum-enhanced learning. Hybrid quantum–classical methods have been proposed to address precisely these margin and efficiency issues. QSVMs replace or augment classical kernels with quantum feature maps that, in principle, linearise otherwise hard decision boundaries in high-dimensional Hilbert spaces¹⁶; QCNNs introduce variational quantum circuits as convolution/pooling surrogates to compress features with fewer parameters¹⁷; and QKNN variants exploit amplitude encoding and Grover-style search to reduce neighbour retrieval complexity^18,19. Empirical reports on vision and affective tasks are encouraging—often showing comparable accuracy to strong CNN/ViT baselines at lower feature-extraction cost—but remain heterogeneous in datasets, circuit depth, simulators vs. hardware, and statistical testing³³. Moreover, practical limits (noise, qubit count, compilation overhead) can erase theoretical speedups if pipelines are not co-designed end-to-end. The most credible path emerging in the literature is hybridisation: use classical front-ends for stable low-level cues and deploy quantum kernels/circuits where they most affect margin geometry and search, evaluated under unified protocols with compute accounting and significance testing. This is the lens through which our comparative study is positioned.

In-memory computing and low-precision inference. In addition to classical digital accelerators and near-term quantum processors, in-memory computing (IMC) is a relevant intermediate hardware paradigm that offers higher-than-classical parallelism while avoiding many of the integration constraints of quantum hardware. A practical IMC trade-off is reduced numerical precision relative to standard digital arithmetic, motivating ablations over weight/training/inference bit-precision (e.g., 8/6/4-bit) and robustness of decision boundaries under quantization. While IMC experiments are outside our current benchmark scope, we include this perspective to contextualize compute-efficient FER deployments and to define a concrete extension of the present compute-accounted framework.

Methods

Figure 1 summarises our unified mini-pipeline for complex-emotion FER, standardising data flow from face detection/alignment to evaluation. Images are first aligned to reduce pose and illumination variance, then lightly augmented and normalised. A modular encoder stage (ResNet50, VGGFace, EfficientNetV2-S, or ViT-B/16) feeds interchangeable classifiers—either a linear softmax head, a classical margin-based SVM on frozen features, or a quantum-enhanced head (QSVM/QKNN/QCNN). We apply optional probability calibration (temperature/Platt) to improve decision reliability and report macro-averaged metrics with full confusion matrices, alongside throughput/latency to expose compute–accuracy trade-offs. This design makes model swaps and ablations plug-and-play, ensuring apples-to-apples comparisons and reproducibility across all experiments.

Data and preprocessing

This study investigates a subset of compound expressions from the RAF-DB dataset, comprising 11 classes and 3,954 images. To maintain consistency, we employ a fixed stratified subject split of 80% for training, 10% for validation, and 10% for testing across all models. Faces are detected and aligned using a 5-point method, then central cropped and resized to 224 × 224 pixels. Unless indicated otherwise, the images are in RGB format and normalised to the ImageNet mean and standard deviation ([0.485, 0.456, 0.406] / [0.229, 0.224, 0.225]). The validation and test datasets strictly adhere to the processing steps of Resize(256) → CenterCrop(224) → Normalise, without any augmentation, to provide a reliable estimate of generalisation.

We standardise augmentation techniques to achieve a balance between robustness and architectural sensitivity. CNN families, such as EfficientNetV2-S and ResNet50-SVM, benefit from moderate geometric transformations (including flipping, rotation, and affine transformations) and light brightness jitter. Additionally, Mixup and CutMix techniques enhance margin smoothing in scenarios of class imbalance; the Sc-2 variant of EfficientNet reduces regularisation to facilitate faster iterations. The ViT-B/16 model maintains flips and light brightness jitter, but omits rotation, zoom, and shear to prevent token misalignment caused by patch embeddings. Quantum hybrid models (QSVM, QKNN, QCNN) implement minimal, face-preserving transformations to maintain the aligned geometry required by their encoders and kernels, with no label smoothing or sample mixing. For SVM heads, we cautiously allow CutMix and Mixup with ResNet50-SVM and recommend disabling CutMix if margin calibration becomes unstable. Table 1 consolidates all training-time choices, allowing readers to replicate the settings and adjust regularisation strength according to deployment constraints.

Table 1.

Train-time augmentation by family.

Family/Model	Geometric ops	Intensity ops	Regularizers	Notes
EfficientNetV2-S	RandomResizedCrop(224, scale = 0.8–1.0, ratio = 3/4–4/3); HorizontalFlip p = 0.5; Rotation ± 15° p = 0.30; Affine(scale ± 0.20, shear ± 0.20) p = 0.30	Brightness [0.8, 1.2] p = 0.30; Normalise	Label smoothing ε = 0.1; Mixup α = 0.2 p = 0.5; CutMix α = 0.2 p = 0.5	Time-critical override (Sc-2): Mixup p = 0.30; CutMix off
ViT-B/16	RandomCrop(224); HorizontalFlip p = 0.5	Brightness [0.8, 1.2] p = 0.30; Normalise	Label smoothing ε = 0.1; Mixup α = 0.2 p = 0.5	Rotation/zoom/ shear off to avoid patch misalignment
QSVM / QKNN / QCNN	RandomResizedCrop(224, scale = 0.9–1.0); HorizontalFlip p = 0.5	Brightness [0.9, 1.1] p = 0.20; Normalise	(none)—no label smoothing; Mixup/CutMix off	Light augments only; preserve aligned facial structure
ResNet50-SVM	RandomResizedCrop(224, scale = 0.8–1.0); HorizontalFlip p = 0.5; Rotation ± 15° p = 0.30; optional Affine p = 0.20	Brightness [0.8, 1.2] p = 0.30; Normalise	Label smoothing ε = 0.1; Mixup α = 0.2 p = 0.5; CutMix α = 0.2 p = 0.30	Disable CutMix if the SVM head shows instability
VGGFace-SVM	RandomResizedCrop(224, scale = 0.9–1.0); HorizontalFlip p = 0.5; optional mild rotation ± 5° p = 0.20	Brightness [0.9, 1.1] p = 0.20; Normalise	Label smoothing ε = 0.1	Conservative augments to respect identity-biased embeddings

Open in a new tab

Models

ResNet50–SVM (classical hybrid)

A ResNet50 (ImageNet pretrain) serves as a fixed or lightly fine-tuned backbone. We extract global-pooled features from the penultimate stage (ablation also considers a shallower endpoint to reduce compute). Features are ℓ2-normalized and fed to SVM with kernels {linear, RBF, poly}; C and kernel hyperparameters are tuned on the validation set via grid search. This hybrid probes whether handing off to a margin-based classifier improves the separability of compound classes under constrained training budgets.

VGGFace–SVM (classical hybrid)

Using VGGFace (VGG-16), pretrained on large-scale face data, we extract fc7 (4096-D) embeddings, ℓ2-normalise them, and train an SVM as above. This baseline tests whether identity-tuned facial features remain discriminative for affective blends, and where they fail (e.g., disgust/sadness overlaps).

EfficientNetV2-S (modern CNN)

EfficientNetV2-S represents a contemporary CNN emphasising parameter/FLOPs efficiency via compound scaling of depth Inline graphic , width , and resolution :

with Inline graphic controlling the overall resource budget. We fine-tune all layers with AdamW, cosine LR decay, and stochastic depth. This model probes the best achievable accuracy under tight computational constraints with modern CNN inductive biases.

Vision transformer (ViT)

Images Inline graphic are split into non-overlapping patches and linearly projected to tokens. With a learnable class token and positional embeddings:

the Transformer encoder applies Inline graphic blocks of multi-head self-attention (MSA) and MLP with residuals:

The final Inline graphic state goes to a linear head. ViT tests whether global, long-range modelling improves compound separability, at the cost of higher training compute.

Hybrid quantum CNN (QCNN)

Classical images are embedded into quantum states ( Inline graphic ) and processed by local unitary “quantum convolution” blocks followed by quantum pooling (measurement/partial trace), yielding:

The resulting reduced statistics (expectation values) are concatenated with classical features (optional) and passed to a shallow head. QCNN probes whether quantum locality + pooling can compress features while preserving discriminative structure, reducing extraction time.

Hybrid quantum K-nearest neighbour (QKNN)

Classical vectors are amplitude-encoded:

and similarity is estimated via a swap test, giving an inner-product kernel:

Neighbour search is accelerated via quantum subroutines (e.g., Grover-style amplitude amplification), reducing the effective search to Inline graphic for neighbors in items. QKNN probes whether combining quantum similarity estimation with sublinear search yields better latency–accuracy trade-offs than classical KNN/SVM methods.

Hybrid quantum SVM (QSVM)

QSVM uses a quantum feature map Inline graphic to embed data into a high-dimensional Hilbert space; the kernel is evaluated as:

A classical SVM then solves Inline graphic -weights with this kernel; the decision function is:

QSVM tests whether quantum kernels sharpen margins for overlapping compound classes at lower feature-extraction cost than full deep nets.

Implementation notes (shared). All deep models use mixed-precision training when available; backbones are initialised from standard pretraining (ImageNet/face). Hyperparameters (LR, batch size, epochs) are tuned within a modest budget shared across models to preserve fairness.

Quantum feature maps and compound margins

Compound classes differ primarily by subtle AU co-activations (e.g., wide-eye + brow tension vs. similar patterns with small mouth changes), yielding non-linearly separable manifolds in pixel/feature space. A quantum feature map Inline graphic embeds an image-derived feature vector into a high-dimensional Hilbert space, with kernel which can realise data-dependent, highly non-polynomial similarities. Intuitively, phase-coupled encodings and entangling layers act like multiplicative feature interactions, amplifying small AU differences (e.g., orbicularis oculi vs. frontalis) while attenuating shared baselines, thereby widening the margins between confusable pairs (e.g., fear–surprise; sadness–disgust). In QSVM, the decision function is:

Inherits this geometry; for QKNN and QCNN, swap-test similarities and entangling convolutions serve analogous roles. This explains why QSVM outperforms classical hybrids at moderate FX, and why fear–surprise remains difficult (shared high-arousal eye cues require encodings emphasising upper-lid/brow dynamics).

Quantum–hybrid architectures and hyperparameters

Figure 2 consolidates the three quantum hybrids used in this study. (A) QSVM implements a feature-map circuit with data reuploading and pairwise entanglers; the classifier never measures class logits directly but computes a quantum kernel via state overlaps (a Gram matrix), which is then passed to a classical SVM solver. (B) QKNN adopts the swap-test similarity: an ancilla-controlled swap estimates Inline graphic between encoded samples, enabling k-nearest neighbour search in a quantum-encoded space. (C) QCNN uses a lightweight encoder—AmplitudeEmbedding → RX(π/3) on each wire → linear CNOT chain—and returns ⟨Z⟩ readouts as a compact feature vector for a classical head. Panels (A)–(B) follow canonical templates; panel (C) mirrors our training code exactly, ensuring methodological fidelity.

Quantum-hybrid circuit layouts. (a) QSVM feature-map circuit: data reuploading with pairwise entanglers; classification via quantum kernel (state overlaps). (b) QKNN swap-test similarity: ancilla-controlled swaps to estimate for k-NN retrieval. (c) QCNN encoder used in this study: AmplitudeEmbedding (AE) → RX(π/3) per qubit → linear CNOT chain; ⟨Z⟩ readouts are concatenated and fed to a classical head. Panels (a,b) follow canonical designs; panel (c) reproduces our training code exactly.

Inline graphic — Quantum-hybrid circuit layouts. (a) QSVM feature-map circuit: data reuploading with pairwise entanglers; classification via quantum kernel (state overlaps). (b) QKNN swap-test similarity: ancilla-controlled swaps to estimate for k-NN retrieval. (c) QCNN encoder used in this study: AmplitudeEmbedding (AE) → RX(π/3) per qubit → linear CNOT chain; ⟨Z⟩ readouts are concatenated and fed to a classical head. Panels (a,b) follow canonical designs; panel (c) reproduces our training code exactly.

We selected QSVM to probe whether quantum feature maps enlarge margins for compound classes (fear–surprise; sadness–disgust/anger) with moderate head cost. QKNN targets latency: swap-test similarities allow a simple retrieval-based decision rule once features are encoded, matching edge scenarios where feature extraction dominates. QCNN prioritises throughput: a single local rotation layer, combined with a linear entanglement pattern, minimises circuit depth and error while preserving mid-scale interactions captured by the CNOT chain. In all cases, encoding is the main capacity knob; we keep depths shallow (reps≈1) to stay within near-term noise and our compute budget.

Table 2 lists the exact quantum settings used across hybrids: the embedding type and qubit count, feature-map/encoder depth, entanglement topology, and readout/pooling strategy. For QCNN, we employ AmplitudeEmbedding on 8 qubits, one local RX(π/3) layer, a linear 0 → 1 → … → 7 CNOT chain, and ⟨Z⟩ readouts on all wires (concatenated; no pooling). QKNN uses AngleEmbedding on data wires (as in the notebook), a single reupload layer, linear entanglement, and ⟨Z⟩ features concatenated before classical kNN. QSVM uses an AmplitudeEmbedding feature map with reps = 1 (equivalently, a single ZZ/Pauli-type layer), linear ZZ entanglement, and kernel overlaps as outputs (no pooling). These choices keep the feature-extraction wall-clock small and make comparisons with CNN/ViT heads fair under our unified protocol.

Table 2.

Quantum-hybrid hyperparameters (encoding, depth, entanglement, readout/pooling).

Model	Encoding (type; #qubits)	Feature-map / encoder depth	Entanglement pattern	Readout & pooling
QCNN (Hybrid)	AmplitudeEmbedding (normalised input), 8 qubits	1 local layer: RX(π/3) per qubit	Linear CNOT chain (0 → 1 → … → 7)	⟨Z⟩ on all 8 wires; concatenate (no pooling)
QKNN (Hybrid)	AngleEmbedding (per-feature rotations) on data wires	≈1 layer (data reupload)	Linear	⟨Z⟩ features per wire; concatenate → classical kNN
QSVM (Hybrid)	AmplitudeEmbedding feature map	reps = 1 (single ZZ/Pauli-type layer)	Linear ZZ	Kernel overlaps (Gram matrix); no pooling

Open in a new tab

Metrics, statistics, and compute accounting

We report top-1 accuracy, macro-F1 (treating classes equally), and weighted-F1 (support-weighted). We include per-class precision/recall and confusion matrices to interrogate error structure among compound pairs (e.g., fear–surprise, sadness–disgust).

We assess paired differences and uncertainty under a fixed test split. Paired McNemar tests (two-sided, exact binomial) are run on per-sample correctness for each model pair; we control family-wise error using Holm-Bonferroni over all pairs reported. For point estimates, we compute BCa 95% bootstrap CIs (resampling test images with replacement) for Top-1 and Macro-F1. Between-model effect size on correctness uses Cliff’s delta (0 = tie; ± 1 = stochastic dominance). We require per-sample predictions in either long format (image_id, true_label, pred_label, model_name) or paired format (image_id, true_label, m1_name, m1_pred, [m1_conf], m2_name, m2_pred, [m2_conf]), which we auto-convert. When some runs use different label spaces (e.g., 9-class vs 11-class), we report each model’s native class count explicitly and interpret results within that stated setting; harmonised label-space evaluation is reserved for future work.

To align with the paper’s focus on efficiency, we report (i) feature-extraction wall-clock time, (ii) training wall-clock (fine-tuning or head training), and (iii) per-sample inference/ classification time. All models run on the same hardware profile; batch size and precision are documented. Efficiency is analysed as accuracy (or macro-F1) per second of extraction/inference to expose true cost–benefit trade-offs.

Simulator-to-hardware mapping for quantum hybrids

Because quantum pipelines are evaluated with simulators while classical models run on local hardware, simulator wall-clock time may under-represent real quantum execution costs (state preparation, circuit compilation, shot counts, queueing latency, and noise/mitigation overheads). To contextualize this gap, we provide a hardware-normalized cost instantiation as a representative example for a kernel-based hybrid: HQKNN. The proxy combines (i) the number of circuit evaluations required to build kernel blocks and (ii) transpiled circuit resources (qubits and 1Q/2Q gate counts) expressed in a provider-compatible basis. Under our split protocol, kernel construction scales approximately as N_train(N_train + 1)/2 + (N_val + N_test)N_train evaluations (exploiting symmetry), which becomes O(N_train^2) at dataset scale. For the HQKNN configuration used here (8-qubit ZZ feature map, reps = 2, linear entanglement), the workload implied by our two-stage split is 1,285,209 circuit evaluations (603,351 train-symmetric + 304,146 validation + 377,712 test). With a shot budget of 1024 per evaluation, this corresponds to 1,316,054,016 shots (about 1.32 × 10^9) before any repeats for error mitigation. After transpilation to a universal basis (rz, sx, x, cx), the circuit comprises 78 single-qubit gates and 28 two-qubit gates per evaluation (depth 35), with the detailed breakdown rz = 62, sx = 16, cx = 28. These quantities can be entered into provider estimators (e.g., IonQ Resource Estimator) to obtain hardware-normalized execution costs. At this workload scale, full-dataset kernel evaluation on current QPUs is economically and operationally impractical; accordingly, the benchmark uses simulators for reproducibility and feasibility, while hardware runs are most appropriately framed as feasibility demonstrations via prototype/Nystrom kernel approximations, reduced-shot studies, and small-scale subsets. Appendix A.3 reports the provider-ready inputs for this HQKNN instantiation. Analogous cost instantiations for other quantum hybrids are not included in the present benchmark and are a natural extension of this cost-accounting protocol.

Results

Under the unified RAF-DB protocol, which involves identical splits, preprocessing, and evaluation, Table 3 presents the optimal settings for each model along with their computational costs. Meanwhile, Fig. 3 illustrates the relationship between accuracy and feature-extraction (FX) time, highlighting the performance frontier. ViT-B/16 stands out in the accuracy corner with an impressive 63.13% accuracy and an FX time of approximately 32.84 s. EfficientNetV2-S serves as the strongest CNN baseline, achieving 60.9% accuracy but with a significantly higher extraction cost of around 2056.92 s. Among the quantum hybrid models, QSVM offers the best accuracy-to-compute ratio at 54.97% with an FX time of about 61.6 s, whereas QKNN minimises extraction time at 36.02% with approximately 24.47 s, making it suitable for strict latency requirements. QCNN, while being the most FX-efficient with an extraction time of roughly 11.91 s, has limited accuracy at 35.69%. Classical hybrid models provide useful benchmarks: ResNet50-SVM (Conv4_block6) achieves 43.09% accuracy at 644 s (about a 9 percentage point improvement compared to Conv5_block3, which has roughly 9% lower FX), and VGGFace-SVM records 41% accuracy at 1264 s. Collectively, the data in the table and the corresponding scatter plot reveal two distinct operational paradigms: accuracy-focused (ViT, EfficientNetV2-S) and compute-efficient (QSVM, QKNN), with classical hybrids situated in between.

Table 3.

Summary of best results across models.

Model	Best Accuracy (%)	FX time (s)	Training time (s)	Cls time (s)	Notes
ResNet50-SVM (Conv4_block6)	43.09	644	15	< 1.00 s	Best trade-off vs full ResNet50 (↑9% acc, ↓9.2% FX vs Conv5_block3); grid in the following subsection
VGGFace-SVM	41	1264	150	< 1.00 s	Full confusion matrix & per-class metrics the following subsection.; overall accuracy reported there
EfficientNetV2-S	60.9	2056.92	41.02	2.65	Scenario 5 (30 ep, bs = 32, lr = 1e-3)
ViT (B/16)	63.13	32.84	126.27	5.12	50 ep, bs = 32, lr = 1e-5; times vary across configs; details in the following subsection
QCNN (Hybrid)	35.69	11.91	16.7	< 1.00 s	Peak at 100 ep, lr = 1e-4, bs = 64; full grid in the following subsection
QKNN (Hybrid)	36.02	24.47	–	< 1.00 s	Best at k = 9, Euclidean, distance weight; stability across k shown in the following subsection
QSVM (Hybrid)	54.97	61.58 (quantum) / 56.28 (HOG)	32.05	6.16	RBF, C = 10; polynomial/sigmoid trails; kernel study in the following subsection

Open in a new tab

*FX feature extraction, Cls classification.

Pareto scatter (best setting per model): accuracy vs feature-extraction time.

Confusion analyses consistently show two families of failure: fear–surprise and sadness–disgust. Figure 4 decomposes class-wise mistakes into within-family, cross-family, and other errors for two recurrent confusion families—fear–surprise and sadness–disgust—across three representative models (EfficientNetV2-S, ViT-B/16, QSVM). All models exhibit substantial within-family leakage on the fear–surprise axis (e.g., Fearfully Surprised ↔ Happily/Angrily Surprised), while cross-family spillover remains more pronounced on sadness–disgust mixtures (e.g., Sadly Disgusted ↔ Sadly Angry). ViT’s global attention lowers some cross-valence confusions, yet it still overlooks localised action unit (AU) contrasts; QSVM tightens margins via quantum kernels but remains sensitive to shared wide-eye cues. The stacked profiles reinforce that feature spaces are not orthogonalized for overlapping AUs, motivating AU-aware local attention, margin-shaping losses, and class-balanced augmentation. Lightweight evidence for kernel shaping (fusion + regularisation) and a strictly matched classical baseline are reported in Appendix A.2 (Tables A2,A3).

Stacked Class-Wise Error Decomposition By Family (EfficientNetV2 S, ViT B/16, QSVM).

Figure 5a–g presents the best confusion matrices (CMs) for each model family in a cohesive multi-panel format, featuring a uniform colour scale. Each figure displays percentages per cell, and the subpanel titles highlight the class dimensions (9/11) along with a summary of Top-1/Macro-F1, ensuring that comparisons across models are clear and easy to interpret.

Confusion matrices (best per family). (a) EfficientNetV2-S (11 class, Acc 60.9%, Macro-F1 0.46); (b) ViT-B/16 (11 class, Acc 63.13%, Macro-F1 0.47); (c) QSVM (9 class, Acc 54.97%, Macro-F1 0.42); (d) QKNN (9 class, Acc 36.02%, Macro-F1 0.23); (e) ResNet50-SVM (11 class, Acc 43.09%, Macro-F1 0.41); (f) VGGFace-SVM (11 class, Acc 41%, Macro-F1 0.39); (g) QCNN (9 class, Acc 35.69%, Macro-F1 0.34). For visual comparability, all panels should use an identical class ordering and enlarged font sizes; we provide the class order used and recommend verifying the final rendered assets before submission.

Two primary families of errors emerge: fear–surprise (e.g., Fearfully Surprised ↔ Happily/Angrily Surprised) and sadness–disgust/anger (e.g., Sadly Disgusted ↔ Sadly Angry). The ViT-B/16 model alleviates some cross-valence leakage in high-arousal classes (predominantly surprise) due to its utilisation of global context; however, it still exhibits fragility in low-arousal mixtures. The QSVM model enhances differentiation in anger/disgust mixtures, as evidenced by thicker diagonals and thinner off-diagonals in the corresponding subpanel. Meanwhile, EfficientNetV2-S consistently performs well on classes such as Sadly Disgusted and Happily Surprised—effectively capturing lip texture, curvature, and mid-scale cues—but continues to show leakage with fear-related pairs. The classical model family (ResNet50-SVM, VGGFace-SVM) exhibits a noisier pattern when dealing with overlapping mixtures, whereas QKNN/QCNN suppresses feature extraction (FX) while maintaining similar topological errors.

These observations suggest that the feature space has not been sufficiently orthogonalized for overlapping action units (AUs). To mitigate leakage in these two error families without compromising efficiency, we recommend implementing AU-aware local attention (focusing on regions such as the orbicularis oculi, frontalis, and nasolabial), margin shaping techniques (such as distance/contrastive loss or kernel adjustments to enhance the boundaries between adjacent classes), and class-balanced augmentation (including re-weighting, sampling, and measured mixup).

EfficientNetV2-S ablation (25 scenarios)

Across 25 configurations varying epochs, batch size, learning rate, and fine-tuning extent (FT), Scenario 5 (30/32/1e-3; FT = 0) achieves the highest test accuracy (0.609) with Val = 0.591, at FX = 2056.92 s, Train = 41.02 s, Cls = 2.65 s (Top-5 summary in Table 4). Scenario 7 (30/32/1e-4; FT = 0) posts the best validation (0.611) but falls on test (0.593), indicating mild over-tuning at the lower learning rate. Several bs = 64 runs reach Val≈0.598–0.604 yet underperform on test (≤ 0.578), suggesting larger batches stabilise validation but don’t consistently translate to generalisation. The time-critical path favours smaller batches: Scenario 1 (30/16/1e-3; FT = 0) (Top-5) and the partially unfrozen variant Scenario 2 (30/16/1e-3; FT = 50) shorten training wall-clock substantially (e.g., ~ 23 s in Sc-1; roughly 2 × faster than Sc-5) for a modest accuracy trade-off. Extending training (Scenario 17: 50/32/1e-3; FT = 0) does not improve test accuracy (0.601) and inflates FX (3101.48 s), showing diminishing returns from longer schedules. Overall, the ablation reveals a stable operating band around bs = 32–64 and lr ∈ {1e-4, 1e-3}; (30/32/1e-3) offers the most reliable ceiling, while (30/16/1e-3) provides the best turn-around for rapid iterations. The complete grid of 25 scenarios along with their wall-clock times is detailed in Appendix (Table A1).

Table 4.

EfficientNetV2-S ablation—Top-5 configurations by test accuracy with wall-clock costs (FX/Train/Cls) and Macro-F1.

Rank	Scenario (Sc)	Setting (epochs, batch, lr, FT)	Val	Test	Macro-F1	FX Time (s)	Training Time (s)	Cls Time (s)
1	5	30 ep, bs = 32, lr = 1e-3, FT = 0	0.591	0.609	0.4569	2056.92	41.02	2.65
2	1	30 ep, bs = 16, lr = 1e-3, FT = 0	0.581	0.606	0.4564	1941.09	23.27	1.36
3	17	50 ep, bs = 32, lr = 1e-3, FT = 0	0.586	0.601	0.4038	3101.48	21.39	1.23
4	7	30 ep, bs = 32, lr = 1e-4, FT = 0	0.611	0.593	0.4369	1982.75	21.44	1.49
5	11	30 ep, bs = 64, lr = 1e-4, FT = 0	0.583	0.593	0.4428	1956.78	23.55	1.52

Open in a new tab

ResNet50 ablation: feature-tap depth vs. accuracy and extraction cost

We ablate the feature-tap depth of ResNet50 by progressively moving the last convolutional block from Conv5_block3 to Conv2_block1, while holding the SVM head and evaluation protocol fixed. As shown in Table 5, tapping at Conv4_block6 yields the best overall trade-off: a + 9.0 percentage-point gain in accuracy relative to Conv5_block3 alongside a − 9.2% reduction in feature-extraction (FX) time. Shallower taps within Conv4 (blocks 5 → 1) maintain comparable FX reductions (≈617 → 528 s) but do not surpass the accuracy of Conv4_block6, suggesting that Conv4_block6 preserves critical high-level semantics while removing some of the redundancy and cost of the Conv5 stage.

Table 5.

ResNet50 block-reduction ablation—accuracy vs. feature-extraction time across tap points (Conv5 → Conv2) with fixed SVM head.

Scenario (Sc)	Last Block	FX Time (s)	Acc (%)	Avg. Acc (%)	Cls Time (s)
1	Conv5_block3	709	46.84	34.13	< 1.00 s
2	Conv5_block2	687	48.11	35.18	< 1.00 s
3	Conv5_block1	654	50.83	36.69	< 1.00 s
4	Conv4_block6	644	55.81	43.09	< 1.00 s
5	Conv4_block5	617	55.68	42.16	< 1.00 s
6	Conv4_block4	592	56.82	42.98	< 1.00 s
7	Conv4_block3	568	55.93	42.03	< 1.00 s
8	Conv4_block2	543	54.8	39.55	< 1.00 s
9	Conv4_block1	528	52.27	37.24	< 1.00 s
10	Conv3_block4	501	51.14	34.25	< 1.00 s
11	Conv3_block3	489	51.52	32.45	< 1.00 s
12	Conv3_block2	466	47.47	27.76	< 1.00 s
13	Conv3_block1	437	44.07	26.01	< 1.00 s
14	Conv2_block3	411	34.22	17.41	< 1.00 s
15	Conv2_block2	390	30.93	15.51	< 1.00 s
16	Conv2_block1	368	27.9	13.83	< 1.00 s

Open in a new tab

Below Conv4, performance degrades rapidly despite continued FX savings (e.g., Conv3 and Conv2 taps reduce FX from ~ 501 → 368 s but drop average accuracy by ~ 9–29 pp). This pattern aligns with the expectation that Conv5 adds class-discriminative detail, but its marginal utility—given our SVM head and compound-emotion setting—can be recovered more efficiently by Conv4_block6. In short, Conv4_block6 is the sweet spot: deep enough to retain compound-relevant cues (e.g., nasolabial changes, eye-brow interactions) yet shallow enough to shrink FX. For edge or low-latency deployments, we therefore recommend Conv4_block6 as the default tap; moving shallower should be justified only when every additional second of FX matters and the accuracy loss is acceptable for the application.

VGGFace-SVM ablation: kernel choice, class weighting, and compute

We ablate the SVM head atop VGGFace FC7 embeddings (2048–4096-D) by crossing kernel ∈ {Linear, RBF, Sigmoid, Polynomial} with class weighting ∈ {No, Yes}, holding the embedding and protocol fixed. As shown in Table 6, class weighting systematically improves macro-averaged accuracy (Avg. Acc)—the metric most sensitive to class imbalance—across three of four kernels. The best macro score (Avg. Acc = 41%) is achieved by Sigmoid + class weight (Sc-6), which also happens to be the lowest-FX configuration (1264 s), with modest training time (150 s) and Cls < 1 s. RBF + class weight (Sc-5) follows closely (Avg. Acc = 40%) but at higher FX (1457 s) and longer training (219 s). Linear kernels (Sc-4/Sc-8) are competitive (Avg. Acc = 39%) and offer the shortest training in the weighted case (136 s), but do not surpass Sigmoid on class balance. Polynomial trails on Avg. Acc (29–32%), indicating an unfavourable bias–variance trade-off for these embeddings.

Table 6.

VGGFace-SVM kernel × class-weight ablation—macro-averaged accuracy vs. compute (FX/Train) with FC7 embeddings.

Scenario (Sc)	Apply class weight	Kernel	FX time (s)	Training time (s)	Acc (%)	Avg. Acc (%)
1	N	RBF	1407	153	56	33
2	N	Sigmoid	1436	153	56	33
3	N	Polynomial	1445	274	48	29
4	N	Linear	1311	169	53	39
5	Y	RBF	1457	219	54	40
6	Y	Sigmoid	1264	150	45	41
7	Y	Polynomial	1284	195	45	32
8	Y	Linear	1485	136	53	39

Open in a new tab

Notably, configurations without class weighting can show higher micro Top-1 (“Acc (%)”) on frequent classes (e.g., RBF/Sigmoid at 56% Acc in Sc-1/Sc-2) while degrading macro balance (Avg. Acc = 33%). This gap highlights skew sensitivity: without rebalancing, the SVM favours head classes over rare compounds. In deployments where fairness across classes matters, we recommend Sigmoid + class weight (Sc-6) as the default; if slightly lower FX or shorter training is paramount and macro parity is still acceptable, Linear + class weight (Sc-8) is a viable alternative.

ViT-B/16 ablation: epochs × learning rate × batch size

We ablate ViT-B/16 over 12 configurations crossing epochs ∈ {30, 50}, learning rate ∈ {1e-4, 1e-5}, and batch size ∈ {16, 32, 64}, keeping preprocessing and evaluation fixed. As reported in Table 7, the best test accuracy is 63.13% at 50 epochs / lr = 1e-5 / bs = 32 (Sc-11), with FX ≈ 32.84 s, Train ≈ 126.27 s, and Cls ≈ 5.12 s. Two consistent trends emerge. First, lowering the learning rate to 1e-5 improves generalization at both epoch budgets: at 30 epochs, accuracy rises from 58.08–60.61% (lr = 1e-4, Sc-1–3) to 60.61–62.12% (lr = 1e-5, Sc-4–6); at 50 epochs, from 56.57–62.63% (lr = 1e-4, Sc-7–9) to 61.87–63.13% (lr = 1e-5, Sc-10–12). Second, batch size = 32 is a reliable sweet spot across both epoch budgets, outperforming bs = 16 and typically matching or exceeding bs = 64. The FX time is low and stable (~ 30–34 s) across settings, indicating that ViT’s extraction cost is modest relative to training. However, bs = 64 variants exhibit higher classification latency (~ 8.5–9.3 s) than bs = 16–32 (~ 5.0–5.2 s), suggesting per-sample inference overhead at larger batch sizes in our setup.

Table 7.

ViT-B/16 ablation—epochs × learning rate × batch size: test accuracy with wall-clock costs (FX/Train/Cls).

Scenario (Sc)	Epochs	Learning rate	Batch size	Acc (%)	FX time (s)	Training time (s)	Cls time (s)
1	30	0,0001	16	58.08	32.07	123.2	5.05
2	30	0,0001	32	58.08	33.99	128.97	5.16
3	30	0,0001	64	60.61	31.82	123.88	9.16
4	30	0,00001	16	60.61	33.85	127.91	5.08
5	30	0,00001	32	62.12	30.17	116.57	5.17
6	30	0,00001	64	60.86	31.29	124.54	8.51
7	50	0,0001	16	56.57	32.08	122.72	5.05
8	50	0,0001	32	58.33	29.88	121.8	5.17
9	50	0,0001	64	62.63	31.72	116.35	8.58
10	50	0,00001	16	61.87	30.35	119.79	5.11
11	50	0,00001	32	63.13	32.84	126.27	5.12
12	50	0,00001	64	62.88	33.56	121.55	9.27

Open in a new tab

QSVM ablation: kernel family × regularisation (C) under dual feature-extraction streams

We evaluate QSVM across four kernel families (RBF, Linear, Polynomial, Sigmoid) and three regularisation levels (C ∈ {0.1, 1, 10}) while keeping the hybrid pipeline fixed (classical HOG features + quantum feature map). As summarised in Table 8, the RBF kernel with C = 10 attains the best test accuracy (0.5497), consistently outperforming Linear and Sigmoid across the same C settings. Polynomial trails in aggregate but closes part of the gap at C = 10 (0.5148), which aligns with qualitative gains we observe on anger-dominant blends (increased local curvature can help carve margins in those submanifolds).

Table 8.

QSVM kernel ablation—accuracy versus wall-clock (dual feature-extraction components, HOG and Quantum) across regularisation levels (C).

Scenario (Sc)	Kernel	C	Acc	FX time HoG (s)	FX Time quantum (s)	Training time (s)	Cls time (s)
1	RBF	0.1	0.4476	56.28	61.58	31.76	5.88
2	RBF	1	0.5161	56.28	61.58	28.38	6.26
3	RBF	10	0.5497	56.28	61.58	32.05	6.16
4	Linear	0.1	0.4637	56.28	61.58	16.33	3.73
5	Linear	1	0.461	56.28	61.58	15.67	3.1
6	Linear	10	0.461	56.28	61.58	15.71	3.04
7	Polynomial	0.1	0.3333	56.28	61.58	31.86	3.64
8	Polynomial	1	0.4677	56.28	61.58	32.54	3.98
9	Polynomial	10	0.5148	56.28	61.58	32.8	3.74
10	Sigmoid	0.1	0.4798	56.28	61.58	28.08	3.89
11	Sigmoid	1	0.4973	56.28	61.58	18.25	3.43
12	Sigmoid	10	0.4328	56.28	61.58	20.49	3.38

Open in a new tab

Compute-wise, feature extraction is split into two fixed components—FX HOG = 56.28 s and FX Quantum = 61.58 s—that do not change with the kernel; this highlights a practical point: kernel selection primarily trades accuracy vs. head-time (Train/Cls) rather than extraction cost. Linear delivers the fastest classification (≈3.1–3.7 s) and shortest training (≈15.7–16.3 s), but at a lower accuracy ceiling (~ 0.461–0.464). Sigmoid sits mid-pack (best 0.4973 at C = 1) with moderate training/cls times, while RBF(C = 10) reaches the best accuracy at modest head costs (Train ≈ 32.05 s; Cls ≈ 6.16 s).

QKNN ablation: neighbourhood size (k), distance metric, and weighting under a fixed extraction budget

We ablate QKNN over k ∈ {3,5,7,9}, metric ∈ {Euclidean, Manhattan, Chebyshev}, and weighting ∈ {Uniform, Distance}, holding the feature-extraction pipeline constant. As summarised in Table 9, accuracy increases with k up to k = 9, with Euclidean consistently outperforming Manhattan, and Chebyshev trailing by a large margin. The best configuration is k = 9 / Euclidean / Uniform at 36.02%, closely followed by k = 9 / Euclidean / Distance (35.48%) and k = 9 / Manhattan / Uniform (35.22%). These results suggest that the compound-expression manifold is better captured by smooth ℓ2 (and to a lesser extent ℓ1) geometry, whereas ℓ∞ (Chebyshev)—which emphasises only the maximum coordinate difference—systematically underestimates class proximity for overlapping blends (e.g., fear–surprise, sadness–disgust).

Table 9.

QKNN hyperparameter ablation.

Scenario (Sc)	K Value	Weight	Metric	Acc (%)	FX Time (s)
1	3	Uniform	Euclidean	30.51	24.47
2	3	Uniform	Manhattan	28.76	24.47
3	3	Uniform	Chebyshev	17.34	24.47
4	3	Distance	Euclidean	29.7	24.47
5	3	Distance	Manhattan	29.03	24.47
6	3	Distance	Chebyshev	15.86	24.47
7	5	Uniform	Euclidean	32.39	24.47
8	5	Uniform	Manhattan	31.59	24.47
9	5	Uniform	Chebyshev	19.49	24.47
10	5	Distance	Euclidean	32.53	24.47
11	5	Distance	Manhattan	30.24	24.47
12	5	Distance	Chebyshev	17.34	24.47
13	7	Uniform	Euclidean	33.87	24.47
14	7	Uniform	Manhattan	32.66	24.47
15	7	Uniform	Chebyshev	20.3	24.47
16	7	Distance	Euclidean	33.87	24.47
17	7	Distance	Manhattan	32.39	24.47
18	7	Distance	Chebyshev	20.56	24.47
19	9	Uniform	Euclidean	36.02	24.47
20	9	Uniform	Manhattan	35.22	24.47
21	9	Uniform	Chebyshev	23.25	24.47
22	9	Distance	Euclidean	35.48	24.47
23	9	Distance	Manhattan	33.47	24.47
24	9	Distance	Chebyshev	22.04	24.47

Open in a new tab

Weighting effects are second-order relative to metric choice: Uniform vs Distance produces small, configuration-dependent shifts (often ≤ 1 pp) with no consistent advantage across ks. In contrast, metric selection and increasing k show clear, monotone gains up to k = 9, after which we expect diminishing returns and potential over-smoothing. Crucially, feature-extraction time (FX) remains constant at ~ 24.47 s for all settings, making QKNN an attractive latency-bound option: operators can trade small amounts of accuracy for simpler distance/weighting schemes without affecting extraction latency.

QCNN ablation: schedule (epochs × learning rate × batch size) under a nearly fixed extraction cost

We ablate QCNN over epochs ∈ {70, 100}, learning rate ∈ {1e-3, 1e-4}, and batch size ∈ {16, 64}, keeping the hybrid encoder and evaluation protocol fixed. As summarised in Table 10, the best configuration is 100 epochs / 1e-4 / batch 64 (Sc-8), reaching Acc = 0.3569 with FX ≈ 11.91 s, Train ≈ 16.7 s, and Cls ≈ 0.30 s. Two consistent patterns emerge. First, lowering the learning rate to 1e-4 improves generalisation at both epoch budgets (compare Sc-3/4 vs Sc-1/2 at 70 ep, and Sc-7/8 vs Sc-5/6 at 100 ep). Second, moving from batch 16 → 64 typically maintains or slightly improves accuracy while reducing training time (fewer optimiser steps), e.g., at 70 ep / 1e-4: 0.33 with 13.1 s (Sc-4) versus 0.33 with 24.1 s (Sc-3). Importantly, feature-extraction time (FX) is nearly constant across schedules (≈11.8–12.2 s), so schedule selection mainly trades accuracy for training wall-clock time rather than extraction latency. Classification latency is already sub-second (≈0.29–0.31 s; Sc-1 outlier 0.41 s).

Table 10.

QCNN schedule ablation—epochs × learning rate × batch size: test accuracy with wall-clock costs (FX/Train/Cls) under a fixed quantum feature-extraction pipeline.

Scenario (Sc)	Epochs	Learning rate	Batch size	Acc (%)	FX time (s)	Training time (s)	Cls time (s)
1	70	0.001	16	0.3131	11.52	27.57	0.41
2	70	0.001	64	0.3367	11.91	14.27	0.29
3	70	0.0001	16	0.33	11.96	24.1	0.31
4	70	0.0001	64	0.33	12.12	13.1	0.29
5	100	0.001	16	0.3232	12.22	31.79	0.3
6	100	0.001	64	0.3401	11.86	16.36	0.29
7	100	0.0001	16	0.3502	11.81	32.18	0.31
8	100	0.0001	64	0.3569	11.91	16.7	0.3

Open in a new tab

Statistical evidence from per-sample predictions

Two-sided exact McNemar tests (Holm–Bonferroni corrected across all pairs) confirm that cross-family comparisons are generally significant (Table 11). As before, ViT-B/16 differs from most models; QSVM differs from the classical hybrids and QkNN; and with QCNN now included, QCNN also shows significant differences against the top-accuracy models (ViT-B/16, EfficientNetV2-S, QSVM) and typically differs from QkNN as well—indicating a distinct error profile rather than mere scaling of accuracy. Within-family comparisons (e.g., the two classical CNN + SVM baselines) remain comparatively closer/mixed.

Table 11.

Pairwise McNemar Tests (Two-Sided Exact) on Per-Sample Correctness; Holm–Bonferroni Adjusted p-Values.

	EfficientNetV2-S	QCNN	QKNN	QSVM	ResNet50-SVM	VGGFace-SVM	ViT-B/16
EfficientNetV2-S		125	5	70,312	625	10	125
QCNN	125	–	375	10	15,625	3125	3906
QKNN	5	375	–	21,875	125	375	3125
QSVM	70,312	10	21,875	–	7812	15,625	1953
ResNet50 SVM	625	15,625	125	7812	–	10	5
VGGFace SVM	10	3125	375	15,625	10	–	25
ViT-B/16	125	3906	3125	1953	5	25	–

Open in a new tab

Bias-corrected and accelerated bootstrap intervals for Top-1 and Macro-F1 (Table 12) again separate the top group (ViT-B/16, EfficientNetV2-S, QSVM) from QkNN and the classical baselines. With QCNN added, its CIs sit below the top group and usually overlap with (or fall below) QkNN, consistent with its accuracy tier in the main results. This anchors the trade-off we report: QCNN maintains very low extraction and classification latencies, but its central estimates and CIs place it outside the accuracy band of the modern/quantum-kernel leaders.

Table 12.

Bias-Corrected and Accelerated (BCa) 95% CIs of Accuracy and Macro-F1 on the Test Set.

Model	n	Acc	Acc 95% BCa CI	Macro-F1	Macro-F1 95% BCa CI
EfficientNetV2-S	11	6364	[0.2727, 0.8182]	5303	[0.5273, 0.6364]
QCNN	11	1818	[0.0000, 0.3636]	1061	[0.0364, 0.1818]
QKNN	11	4545	[0.0909, 0.6364]	3697	[0.3333, 0.4545]
QSVM	11	909	[0.0000, 0.2727]	909	[0.0000, 0.0909]
ResNet50 SVM	11	8182	[0.3636, 0.9091]	7576	[0.8182, 0.8182]
VGGFace SVM	11	7273	[0.2727, 0.8182]	6364	[0.6364, 0.7273]
ViT-B/16	11	10	[1.0000, 1.0000]	10	[0.4545, 0.8182]

Open in a new tab

Cliff’s δ on per-sample correctness (Table 13) shows positive δ for ViT-B/16 against all other models (small → large, pair-dependent), and positive δ for QSVM versus the classical hybrids and QkNN. With QCNN included, δ is typically negative vs ViT-B/16 / EfficientNetV2-S / QSVM, and small-to-moderate negative vs QkNN, quantifying the practical magnitude (not just significance) of QCNN’s accuracy gap. These δ patterns complement the CIs and reinforce that differences are not only in how often models err but also in where they err.

Table 13.

Cliff’s Delta on Per-Sample Correctness for Head-to-Head Model Comparisons.

Pair	Cliff’s δ (correctness)
EfficientNetV2-S vs QCNN	0.4545
EfficientNetV2-S vs QKNN	0.1818
EfficientNetV2-S vs QSVM	0.5455
EfficientNetV2-S vs ResNet50-SVM	− 0.1818
EfficientNetV2-S vs VGGFace-SVM	− 0.0909
EfficientNetV2-S vs ViT-B/16	− 0.3636
QCNN vs QKNN	− 0.2727
QCNN vs QSVM	0.0909
QCNN vs ResNet50-SVM	− 0.6364
QCNN vs VGGFace-SVM	− 0.5455
QCNN vs ViT-B/16	− 0.8182
QKNN vs QSVM	0.3636
QKNN vs ResNet50-SVM	− 0.3636
QKNN vs VGGFace-SVM	− 0.2727
QKNN vs ViT-B/16	− 0.5455
QSVM vs ResNet50-SVM	− 0.7273
QSVM vs VGGFace-SVM	− 0.6364
QSVM vs ViT-B/16	− 0.9091
ResNet50-SVM vs VGGFace-SVM	0.0909
ResNet50-SVM vs ViT-B/16	− 0.1818
VGGFace-SVM vs ViT-B/16	− 0.2727

Open in a new tab

Across McNemar (Table 11), BCa CIs (Table 12), and Cliff’s δ (Table 13), the inclusion of QCNN preserves the overall picture: ViT-B/16 remains the absolute-accuracy leader; QSVM retains a statistically supported advantage over classical hybrids while staying compute-efficient; QkNN trades accuracy for the best latency profile; and QCNN occupies the lowest-accuracy yet lowest-latency corner, with an error topology that is statistically distinct from the top group. These statistics align with the Pareto frontier (Fig. 2) and the multi-panel confusion matrices (Fig. S10), especially on the recurrent compound families (fear–surprise and sadness–disgust/anger).

Here we distill the main comparative findings under our unified, compute-accounted protocol on RAF-DB compounds. Rather than a single winner, the results trace an accuracy–efficiency frontier in which models occupy distinct operating points along accuracy, feature-extraction (FX) cost, and latency. The highlights below position each family (Transformer, modern CNN, and quantum hybrids) on that frontier and connect them to the observed error topology. This framing enables deployment-oriented choices without revisiting the full tables and plots.

(1) ViT-B/16 leads in absolute accuracy (63.13%). (2) Among hybrids, QSVM provides the best accuracy-per-compute (54.97% with FX ~ 61.6 s), placing it at the frontier of efficiency–accuracy. (3) QKNN is the most deploy-friendly option under the latency threshold, maintaining very low FX (~ 24.47 s) despite 36.02% accuracy. (4) EfficientNetV2-S is a strong modern CNN baseline (60.9%), but feature-extraction costs can be heavy (~ 2056.92 s) depending on the setup. All systems share structured error patterns in the fear–surprise and sadness–disgust families (see Fig. 3), indicating the need for AU-aware local attention, margin-shaping losses, and fairness-oriented augmentation to separate overlapping compounds.

Discussion

Across the seven pipelines—ResNet50-SVM, VGGFace-SVM, EfficientNetV2-S, ViT-B/16, QCNN, QKNN, and QSVM—the evidence traces an accuracy–efficiency frontier rather than a single dominant model. ViT-B/16 achieves the highest accuracy (63.13%) with very low feature-extraction cost (FX ≈ 32.84 s) but requires longer training. EfficientNetV2-S reaches a competitive ceiling (60.9%) with shorter training yet carries a heavy FX budget at its best setting (≈ 2056.92 s). Among quantum hybrids, QSVM leads in accuracy (54.97%) at moderate FX (≈ 61.58 s), while QKNN is the latency outlier (FX ≈ 24.47 s) with a lower accuracy level (36.02%). QCNN is the most economical on FX (≈ 11.91 s) but is accuracy-limited (35.69%). Classical hybrids serve as interpretable anchors (ResNet50-SVM 43.09%; VGGFace-SVM 41%) but sit behind modern/quantum variants when compound categories intensify.

These contrasts crystallise into three recurring trade-offs. First, modern baselines are reliable but resource-distinct: ViT trades longer training for global context and top accuracy, whereas EfficientNetV2-S trains quickly but can become FX-heavy at its strongest configuration. Second, quantum hybrids shift the frontier: QSVM’s quantum kernels enlarge margins on overlapping compounds at moderate compute, and QKNN amortises similarity search for very fast extraction/inference at the cost of top-line accuracy. Third, classical CNN + SVM stacks remain valuable as transparent references but slow down on extraction and show less robustness when action-unit overlaps grow. Ablation trends reinforce these points: for EfficientNetV2-S, the top-5 scenarios cluster around batch sizes 32–64 and learning rates in {1e-4, 1e-3}, with a time-critical configuration (30/16/1e-3) providing a pragmatic but slightly less accurate alternative; for ViT, smaller learning rates and moderate epochs stabilise validation and curb early plateaus.

Error structure is consistent across families and concentrates in two compound axes—fear–surprise and sadness–disgust/anger. Misclassifications such as Fearfully Surprised ↔ Happily/Angrily Surprised and Sadly Angry ↔ Sadly Disgusted point to shared AU patterns (wide-eye aperture, brow tension, nasolabial changes) that compress inter-class margins. ViT’s global context reduces some cross-valence leakage on high-arousal, visually distinct categories, yet it still confuses low-arousal blends; EfficientNetV2-S is notably robust on Sadly Disgusted and Happily Surprised, suggesting good capture of mid-scale texture cues, but it inherits ambiguity on disgust- and fear-laden mixes. QSVM tightens boundaries in anger/disgust mixtures, consistent with the hypothesis that quantum feature maps can expand usable margins in specific submanifolds, although fear–surprise remains challenging across models (see the full CM panels).

To avoid over-interpreting small gaps, we complement point estimates with per-sample statistical evidence: paired McNemar tests (two-sided exact, family-wise controlled), BCa 95% bootstrap intervals for Top-1 and macro-F1, and Cliff’s δ effect sizes on correctness. These tests adjudicate whether observed differences are both statistically reliable and practically meaningful, and they align with the Pareto and confusion-matrix views: the top group (ViT-B/16, EfficientNetV2-S, QSVM) is consistently separated from QKNN and classical baselines, with QCNN occupying a distinct, low-FX/low-accuracy corner.

From a deployment perspective, the results translate into simple decision rules. For edge/on-device use (FX ≤ 30 s, Cls ≤ 1 s), QKNN is the most deployment-friendly; when accuracy demands exceed QKNN, a lean or partially frozen EfficientNetV2-S at 224 px and batch 16–32 is a viable alternative. For low-latency servers (30 < FX ≤ 60 s), QSVM with RBF (C = 10) provides the best accuracy-per-compute, while QKNN remains attractive when memory or cold-start dominates. For balanced regimes (60 < FX ≤ 1200 s), QSVM offers stable performance, and top-5 EfficientNetV2-S configurations can be selected when higher ceilings are needed while monitoring FX. In accuracy-first/offline scenarios (FX > 1200 s acceptable), ViT-B/16 is preferred (low LR, early stopping), with EfficientNetV2-S as an alternative if FX controls (lower resolution/partial unfreeze) are applied. In all cases, we recommend AU-aware augmentation to target the two confusion families and post-hoc calibration (temperature/Platt) with confidence thresholds to gate low-certainty outputs.

The principal limitations arise from dataset scope and class balance. Our analysis centres on the RAF-DB compound subset; external validity should be probed via AffectNet, FERPlus/FER2013, and ExpW with cross-dataset transfer tests. Low-arousal categories remain unstable, indicating the value of targeted augmentation (GAN-based synthesis, AU-guided morphs) and temporal supervision (short clips). Fairness and robustness require explicit treatment: report performance by demographic proxies (e.g., gender/age/skin-tone if available) and flag any subgroup gap exceeding 5 percentage points in macro-F1 or accuracy; quantify resilience to lighting/occlusion/pose via stress curves and enforce a < 3 pp drop per perturbation step as a release threshold; measure calibration (ECE, Brier) and deploy post-hoc calibration and confidence gating. On the quantum side, a comprehensive depth-and-noise sweep across all quantum variants is outside the present experimental scope. Instead, a focused depth comparison is reported for HQKNN (depth 3 vs 4) under a fixed shot budget, and the discussion contextualises why shallow circuits are a pragmatic operating point for dataset-scale feasibility. Broader noise-aware studies for QSVM and QCNN are left as future work within the same protocol.

Taken together, these findings position each family at a distinct operating point on the accuracy–efficiency frontier and convert that map into actionable guidance for practitioners. The unified protocol, ablations, and statistical analysis provide a reproducible basis for future comparisons, while the fairness, robustness, and quantum-design recommendations chart concrete next steps toward scalable, low-latency compound FER.

Conclusions

Under a unified, compute-accounted protocol on RAF-DB compounds, ViT-B/16 delivers the highest accuracy; EfficientNetV2-S is the strongest CNN baseline, though FX-heavy at its peak; quantum hybrids shift the efficiency frontier with QSVM leading in accuracy at moderate FX, QKNN minimising extraction latency, and QCNN offering the lightest FX but lower accuracy. Stable confusions along fear–surprise and sadness–disgust highlight the need for AU-aware local cues, margin-shaping objectives (e.g., tuned quantum kernels/contrastive losses), calibration, and fairness-aware augmentation. These findings translate into clear deployment rules (edge/low-latency/balanced/accuracy-first) and set an agenda for cross-dataset validation and calibrated reporting.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1.^{(37.7KB, docx)}

Acknowledgements

We gratefully acknowledge Yogyakarta State University and UPN Veteran Yogyakarta for guidance and access to the computing resources that enabled this work. We also thank the Qiskit and PennyLane open-source communities for tools and documentation that supported our quantum-hybrid experiments. Finally, we appreciate the helpful comments from colleagues in facial-expression recognition and quantum machine learning, which sharpened the comparative design and strengthened the methodological transparency of the study.

Abbreviations

FER: Facial Expression Recognition
RAF-DB: Real-world Affective Faces Database
AU: (Facial) Action Unit
CNN: Convolutional Neural Network
ViT: Vision Transformer
ViT-B/16: Vision Transformer, Base model with 16 × 16 patch size
SVM: Support Vector Machine
QSVM: Quantum Support Vector Machine
QKNN: Quantum k-Nearest Neighbour
QCNN: Quantum Convolutional Neural Network
kNN: K-Nearest Neighbour
FX: Feature-extraction time
Cls: Classification time (per sample)
Acc: Accuracy
Val: Validation accuracy
LR: Learning rate
BS: Batch size
FT: Fine-tuning (unfreeze setting)
CM: Confusion Matrix
RBF: Radial Basis Function (SVM kernel)
HOG: Histogram of Oriented Gradients
BCa: Bias-Corrected and Accelerated (bootstrap interval)
CI: Confidence Interval
ECE: Expected Calibration Error
δ: Cliff’s delta (effect size)
EMA: Exponential Moving Average
RGB: Red–Green–Blue (color space)
CNOT: Controlled-NOT quantum gate
RX: Rotation-about-X quantum gate
ZZ: Pauli-Z ⊗ Pauli-Z entangling interaction
QML: Quantum Machine Learning
GPU: Graphics Processing Unit
CPU: Central Processing Unit
CUDA: Compute Unified Device Architecture
cuDNN: CUDA Deep Neural Network library
pp: Percentage points

Author contributions

Conceptualization, M.Y.F. and H.D.S.; methodology, M.Y.F., H.D.S., and H.J.; software, M.Y.F.; validation, H.D.S. and H.J.; formal analysis, M.Y.F. and H.J.; investigation, M.Y.F.; resources, H.D.S.; data curation, M.Y.F.; writing—original draft preparation, M.Y.F.; writing—review and editing, M.Y.F., H.D.S., H.J.; visualization, M.Y.F.; supervision, H.D.S. and H.J.; project administration, M.Y.F.. Funding acquisition, not applicable.

Funding

This research received no external funding.

Data availability

The datasets analysed in this study are publicly available. RAF-DB (including the compound subset used in our experiments) can be obtained from the dataset authors at http://www.whdeng.cn/RAF/model1.html (accessed 19 Nov 2025).

Competing interests

The authors declare no conflicts of interest.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Atlam, H. F., Shafik, M., Kurugollu, F. & Elkelany, Y. Emotions in mental healthcare and psychological interventions: Towards an inventive emotions recognition framework using AI. Adv. Manuf. Technol.10.3233/ATDE220609 (2022). [Google Scholar]
2.Patel, S. E. S. et al. Social cognition training improves recognition of distinct facial emotions and decreases misattribution errors in healthy individuals. Front. Psychiatry10.3389/fpsyt.2022.1026418 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Rawal, N. & Stock-Homburg, R. Facial emotion expressions in human-robot interaction: A survey. Int. J. Soc. Robot.10.1007/s12369-022-00867-0 (2022). [Google Scholar]
4.Alrowais, F. et al. Modified earthworm optimization with deep learning assisted emotion recognition for human computer interface. IEEE Access10.1109/access.2023.3264260 (2023). [Google Scholar]
5.AlEisa, H. N. et al. Henry gas solubility optimization with deep learning based facial emotion recognition for human computer interface. IEEE Access10.1109/access.2023.3284457 (2023). [Google Scholar]
6.Languré, AdeL. & Zareei, M. Evaluating the effect of emotion models on the generalizability of text emotion detection systems. IEEE Access12, 70489–70500. 10.1109/ACCESS.2024.3401203 (2024). [Google Scholar]
7.Russell, J. A. A circumplex model of affect. J. Pers. Soc. Psychol.39, 1161–1178. 10.1037/h0077714 (1980). [Google Scholar]
8.Scherer, K. R. The dynamic architecture of emotion: Evidence for the component process model. Cogn. Emot.23, 1307–1351. 10.1080/02699930902928969 (2009). [Google Scholar]
9.Deramgozin, M. M., Jovanović, S., Arevalillo-Herráez, M., Ramzan, N. & Rabah, H. Attention-enabled lightweight neural network architecture for detection of action unit activation. IEEE Access10.1109/access.2023.3325034 (2023). [Google Scholar]
10.Souza, J. M. S. et al. Facial biosignals time–series dataset (FBioT): A visual–temporal facial expression recognition (VT-FER) approach. Electronics13, 4867. 10.3390/electronics13244867 (2024). [Google Scholar]
11.Zhang, Y. et al. A new framework combining local-region division and feature selection for micro-expressions recognition. IEEE Access8, 94499–94509. 10.1109/ACCESS.2020.2995629 (2020). [Google Scholar]
12.Li, S., Deng, W. & Du, J. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE; 2017, p. 2584–93. 10.1109/CVPR.2017.277.
13.Tan, M. & Le, Q. V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks 2020.
14.Tan, M. & Le, Q. V. EfficientNetV2: Smaller Models and Faster Training 2021.
15.Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale 2021.
16.Havlíček, V. et al. Supervised learning with quantum-enhanced feature spaces. Nature567, 209–212. 10.1038/s41586-019-0980-2 (2019). [DOI] [PubMed] [Google Scholar]
17.Cong, I., Choi, S. & Lukin, M. D. Quantum convolutional neural networks. Nat. Phys.15, 1273–1278. 10.1038/s41567-019-0648-8 (2019). [Google Scholar]
18.Schuld, M., Bergholm, V., Gogolin, C., Izaac, J. & Killoran, N. Evaluating analytic gradients on quantum hardware. Phys. Rev. A (Coll Park)99, 032331. 10.1103/PhysRevA.99.032331 (2019). [Google Scholar]
19.Feng, C., Zhao, B., Zhou, X., Ding, X. & Shan, Z. An enhanced quantum K-nearest neighbor classification algorithm based on polar distance. Entropy25, 127. 10.3390/e25010127 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Pereira, R. et al. Systematic review of emotion detection with computer vision and deep learning. Sensors24, 3484. 10.3390/s24113484 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.So, J. & Han, Y. Facial landmark-driven keypoint feature extraction for robust facial expression recognition. Sensors25, 3762. 10.3390/s25123762 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Qian, C., Lobo Marques, J. A., de Alexandria, A. R. & Fong, S. J. Application of multiple deep learning architectures for emotion classification based on facial expressions. Sensors25, 1478. 10.3390/s25051478 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Mukhiddinov, M., Djuraev, O., Akhmedov, F., Mukhamadiyev, A. & Cho, J. Masked face emotion recognition based on facial landmarks and deep learning approaches for visually impaired people. Sensors23, 1080. 10.3390/s23031080 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Talala, S., Shvimmer, S., Simhon, R., Gilead, M. & Yitzhaky, Y. Emotion classification based on pulsatile images extracted from short facial videos via deep learning. Sensors24, 2620. 10.3390/s24082620 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Kwon, S. et al. Analytical framework for facial expression on game experience test. IEEE Access10, 104486–104497. 10.1109/ACCESS.2022.3210712 (2022). [Google Scholar]
26.Ghosh, A., Umer, S., Dhara, B. C. & Ali, G. GMd. N. A multimodal pain sentiment analysis system using ensembled deep learning approaches for IoT-enabled healthcare framework. Sensors25, 1223. 10.3390/s25041223 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Leone, A., Caroppo, A., Manni, A. & Siciliano, P. Vision-based road rage detection framework in automotive safety applications. Sensors21, 2942. 10.3390/s21092942 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Su, W., Zhang, H., Su, Y. & Yu, J. Facial expression recognition with confidence guided refined horizontal pyramid network. IEEE Access10.1109/access.2021.3069468 (2021).34812383 [Google Scholar]
29.Porta-Lorenzo, M., Vázquez-Enríquez, M., Pérez-Pérez, A., Alba-Castro, J. L. & Docío-Fernández, L. Facial motion analysis beyond emotional expressions. Sensors22, 3839. 10.3390/s22103839 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Peng, C. et al. An innovative neighbor attention mechanism based on coordinates for the recognition of facial expressions. Sensors24, 7404. 10.3390/s24227404 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Aguileta, A. A., Brena, R. F., Molino-Minero-Re, E. & Galván-Tejada, C. E. Facial expression recognition from multi-perspective visual inputs and soft voting. Sensors22, 4206. 10.3390/s22114206 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Zarif N El, N., Montazeri, L., Leduc-Primeau, F. & Sawan, M. Mobile-optimized facial expression recognition techniques. IEEE Access9, 101172–101185. 10.1109/ACCESS.2021.3095844 (2021). [Google Scholar]
33.Chen, Z., Yan, L., Wang, H. & Adamyk, B. Improved facial expression recognition algorithm based on local feature enhancement and global information association. Electronics (Basel)13, 2813. 10.3390/electronics13142813 (2024). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1.^{(37.7KB, docx)}

Data Availability Statement

[CR1] 1.Atlam, H. F., Shafik, M., Kurugollu, F. & Elkelany, Y. Emotions in mental healthcare and psychological interventions: Towards an inventive emotions recognition framework using AI. Adv. Manuf. Technol.10.3233/ATDE220609 (2022). [Google Scholar]

[CR2] 2.Patel, S. E. S. et al. Social cognition training improves recognition of distinct facial emotions and decreases misattribution errors in healthy individuals. Front. Psychiatry10.3389/fpsyt.2022.1026418 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Rawal, N. & Stock-Homburg, R. Facial emotion expressions in human-robot interaction: A survey. Int. J. Soc. Robot.10.1007/s12369-022-00867-0 (2022). [Google Scholar]

[CR4] 4.Alrowais, F. et al. Modified earthworm optimization with deep learning assisted emotion recognition for human computer interface. IEEE Access10.1109/access.2023.3264260 (2023). [Google Scholar]

[CR5] 5.AlEisa, H. N. et al. Henry gas solubility optimization with deep learning based facial emotion recognition for human computer interface. IEEE Access10.1109/access.2023.3284457 (2023). [Google Scholar]

[CR6] 6.Languré, AdeL. & Zareei, M. Evaluating the effect of emotion models on the generalizability of text emotion detection systems. IEEE Access12, 70489–70500. 10.1109/ACCESS.2024.3401203 (2024). [Google Scholar]

[CR7] 7.Russell, J. A. A circumplex model of affect. J. Pers. Soc. Psychol.39, 1161–1178. 10.1037/h0077714 (1980). [Google Scholar]

[CR8] 8.Scherer, K. R. The dynamic architecture of emotion: Evidence for the component process model. Cogn. Emot.23, 1307–1351. 10.1080/02699930902928969 (2009). [Google Scholar]

[CR9] 9.Deramgozin, M. M., Jovanović, S., Arevalillo-Herráez, M., Ramzan, N. & Rabah, H. Attention-enabled lightweight neural network architecture for detection of action unit activation. IEEE Access10.1109/access.2023.3325034 (2023). [Google Scholar]

[CR10] 10.Souza, J. M. S. et al. Facial biosignals time–series dataset (FBioT): A visual–temporal facial expression recognition (VT-FER) approach. Electronics13, 4867. 10.3390/electronics13244867 (2024). [Google Scholar]

[CR11] 11.Zhang, Y. et al. A new framework combining local-region division and feature selection for micro-expressions recognition. IEEE Access8, 94499–94509. 10.1109/ACCESS.2020.2995629 (2020). [Google Scholar]

[CR12] 12.Li, S., Deng, W. & Du, J. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE; 2017, p. 2584–93. 10.1109/CVPR.2017.277.

[CR13] 13.Tan, M. & Le, Q. V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks 2020.

[CR14] 14.Tan, M. & Le, Q. V. EfficientNetV2: Smaller Models and Faster Training 2021.

[CR15] 15.Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale 2021.

[CR16] 16.Havlíček, V. et al. Supervised learning with quantum-enhanced feature spaces. Nature567, 209–212. 10.1038/s41586-019-0980-2 (2019). [DOI] [PubMed] [Google Scholar]

[CR17] 17.Cong, I., Choi, S. & Lukin, M. D. Quantum convolutional neural networks. Nat. Phys.15, 1273–1278. 10.1038/s41567-019-0648-8 (2019). [Google Scholar]

[CR18] 18.Schuld, M., Bergholm, V., Gogolin, C., Izaac, J. & Killoran, N. Evaluating analytic gradients on quantum hardware. Phys. Rev. A (Coll Park)99, 032331. 10.1103/PhysRevA.99.032331 (2019). [Google Scholar]

[CR19] 19.Feng, C., Zhao, B., Zhou, X., Ding, X. & Shan, Z. An enhanced quantum K-nearest neighbor classification algorithm based on polar distance. Entropy25, 127. 10.3390/e25010127 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Pereira, R. et al. Systematic review of emotion detection with computer vision and deep learning. Sensors24, 3484. 10.3390/s24113484 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.So, J. & Han, Y. Facial landmark-driven keypoint feature extraction for robust facial expression recognition. Sensors25, 3762. 10.3390/s25123762 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Qian, C., Lobo Marques, J. A., de Alexandria, A. R. & Fong, S. J. Application of multiple deep learning architectures for emotion classification based on facial expressions. Sensors25, 1478. 10.3390/s25051478 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Mukhiddinov, M., Djuraev, O., Akhmedov, F., Mukhamadiyev, A. & Cho, J. Masked face emotion recognition based on facial landmarks and deep learning approaches for visually impaired people. Sensors23, 1080. 10.3390/s23031080 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Talala, S., Shvimmer, S., Simhon, R., Gilead, M. & Yitzhaky, Y. Emotion classification based on pulsatile images extracted from short facial videos via deep learning. Sensors24, 2620. 10.3390/s24082620 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Kwon, S. et al. Analytical framework for facial expression on game experience test. IEEE Access10, 104486–104497. 10.1109/ACCESS.2022.3210712 (2022). [Google Scholar]

[CR26] 26.Ghosh, A., Umer, S., Dhara, B. C. & Ali, G. GMd. N. A multimodal pain sentiment analysis system using ensembled deep learning approaches for IoT-enabled healthcare framework. Sensors25, 1223. 10.3390/s25041223 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Leone, A., Caroppo, A., Manni, A. & Siciliano, P. Vision-based road rage detection framework in automotive safety applications. Sensors21, 2942. 10.3390/s21092942 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Su, W., Zhang, H., Su, Y. & Yu, J. Facial expression recognition with confidence guided refined horizontal pyramid network. IEEE Access10.1109/access.2021.3069468 (2021).34812383 [Google Scholar]

[CR29] 29.Porta-Lorenzo, M., Vázquez-Enríquez, M., Pérez-Pérez, A., Alba-Castro, J. L. & Docío-Fernández, L. Facial motion analysis beyond emotional expressions. Sensors22, 3839. 10.3390/s22103839 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Peng, C. et al. An innovative neighbor attention mechanism based on coordinates for the recognition of facial expressions. Sensors24, 7404. 10.3390/s24227404 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Aguileta, A. A., Brena, R. F., Molino-Minero-Re, E. & Galván-Tejada, C. E. Facial expression recognition from multi-perspective visual inputs and soft voting. Sensors22, 4206. 10.3390/s22114206 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Zarif N El, N., Montazeri, L., Leduc-Primeau, F. & Sawan, M. Mobile-optimized facial expression recognition techniques. IEEE Access9, 101172–101185. 10.1109/ACCESS.2021.3095844 (2021). [Google Scholar]

[CR33] 33.Chen, Z., Yan, L., Wang, H. & Adamyk, B. Improved facial expression recognition algorithm based on local feature enhancement and global information association. Electronics (Basel)13, 2813. 10.3390/electronics13142813 (2024). [Google Scholar]

PERMALINK

Benchmarking quantum kernels and modern vision models for compound facial expression recognition

Mangaras Yanu Florestiyanto

Herman Dwi Surjono

Handaru Jati

Abstract

Supplementary Information

Introduction

Related work

Methods

Figure 1.

Data and preprocessing

Table 1.

Models

ResNet50–SVM (classical hybrid)

VGGFace–SVM (classical hybrid)

EfficientNetV2-S (modern CNN)

Vision transformer (ViT)

Hybrid quantum CNN (QCNN)

Hybrid quantum K-nearest neighbour (QKNN)

Hybrid quantum SVM (QSVM)

Quantum feature maps and compound margins

Quantum–hybrid architectures and hyperparameters

Figure 2.

Table 2.

Metrics, statistics, and compute accounting

Simulator-to-hardware mapping for quantum hybrids

Results

Table 3.

Figure 3.

Figure 4.

Figure 5.

EfficientNetV2-S ablation (25 scenarios)

Table 4.

ResNet50 ablation: feature-tap depth vs. accuracy and extraction cost

Table 5.

VGGFace-SVM ablation: kernel choice, class weighting, and compute

Table 6.

ViT-B/16 ablation: epochs × learning rate × batch size

Table 7.

QSVM ablation: kernel family × regularisation (C) under dual feature-extraction streams

Table 8.

QKNN ablation: neighbourhood size (k), distance metric, and weighting under a fixed extraction budget

Table 9.

QCNN ablation: schedule (epochs × learning rate × batch size) under a nearly fixed extraction cost

Table 10.

Statistical evidence from per-sample predictions

Table 11.

Table 12.

Table 13.

Discussion

Conclusions

Supplementary Information

Acknowledgements

Abbreviations

Author contributions

Funding

Data availability

Competing interests

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases