Abstract
Artificial intelligence–driven educational systems have largely prioritised cognitive adaptation, often neglecting the critical role of learners’ emotional states in shaping engagement and learning outcomes. To address this limitation, this study proposes a multimodal, emotion-aware deep learning framework designed to integrate emotional intelligence into intelligent learning environments. The framework jointly analyses facial expressions, speech characteristics, and textual responses to infer learners’ emotional states and models the interdependencies among these modalities through a graph-based fusion mechanism. The proposed approach is evaluated using benchmark emotion datasets, namely AffectNet and IEMOCAP, to assess its capability to recognise emotional patterns and support adaptive feedback during learning interactions. Experimental results demonstrate that incorporating emotional awareness leads to substantial improvements in learner engagement, emotional regulation, and task persistence when compared with conventional cognition-focused systems. The framework achieves consistently high emotion recognition performance, particularly for positive and neutral affective states, and shows robust generalisation across different emotion categories. User study outcomes further suggest that learners perceive the system as more supportive and responsive due to its emotional adaptability. In addition to performance evaluation, the study discusses key ethical considerations associated with emotion-aware educational technologies, including data privacy, informed consent, and responsible deployment. Overall, the findings underscore the potential of multimodal emotional intelligence to advance the development of more empathetic, adaptive, and effective artificial intelligence-based educational systems.
Keywords: Deep learning, Emotional intelligence, Personalised learning, Multimodal data, AI education, Facial expression recognition, Speech sentiment analysis
Subject terms: Mathematics and computing, Psychology, Psychology
Introduction
Artificial intelligence has rapidly transformed the educational landscape by enabling adaptive learning, personalised instruction, and automated assessment mechanisms. Most AI-driven educational systems, however, primarily emphasise cognitive performance while overlooking the emotional factors that critically influence learner engagement, motivation, and persistence1,2. Educational research has long established that emotions such as curiosity, frustration, anxiety, and satisfaction play a decisive role in shaping attention, memory, and problem-solving abilities during learning activities. Consequently, the absence of emotional intelligence in intelligent tutoring systems limits their effectiveness and responsiveness3.
Emotional intelligence in education refers to the ability to recognise, interpret, and respond appropriately to learners’ emotional states. Positive emotions tend to facilitate deeper engagement and knowledge retention, whereas negative emotions can obstruct learning progress4. Integrating emotional intelligence into AI-driven educational platforms is therefore essential for developing adaptive systems that respond not only to learners’ cognitive needs but also to their affective conditions, enabling more empathetic and supportive learning environments5.
Current methods and challenges
Recent advances in affective computing have enabled AI systems to recognise emotions through multiple modalities, including facial expressions, speech signals, and textual interactions. Facial expression recognition techniques analyse visual cues to infer emotional states, speech-based approaches capture paralinguistic features such as tone and pitch, and text sentiment analysis extracts affective information from written responses. More recently, multimodal emotion recognition systems have emerged, combining information from multiple sources to obtain a holistic representation of learners’ emotional states6.
Despite these advances, several limitations persist. Many existing systems rely on single-modality analysis or employ simplistic fusion strategies that fail to capture complex interdependencies among emotional cues. Furthermore, emotion recognition models often struggle with generalisation across diverse learners and dynamic educational contexts. These limitations highlight the need for integrated frameworks capable of effectively modelling multimodal emotional information while supporting real-time adaptability in learning environments7.
Motivation of the work
The motivation for this work arises from the growing need for AI-driven educational systems that are not only cognitively adaptive but also emotionally aware. While current intelligent learning platforms can personalise content based on performance metrics, they often neglect the emotional variability that significantly affects learning outcomes. Addressing learners’ emotional states can help reduce frustration, sustain engagement, and promote task persistence8.
This study is driven by the premise that incorporating emotional intelligence into AI-based education can lead to more personalised, empathetic, and effective learning experiences. By integrating emotional cues from multiple modalities and modelling their interactions, the proposed framework aims to support adaptive feedback mechanisms that align instructional strategies with learners’ affective states9,10.
Key contributions
The main contributions of this work are summarised as follows:
A multimodal emotion-aware AI framework that integrates facial, speech, and textual emotional cues to support adaptive educational interactions11.
An effective fusion strategy that models relationships among multimodal emotional signals to enhance robustness and interpretability12.
A comprehensive experimental evaluation demonstrating improved engagement, emotional regulation, and task completion compared to conventional cognitive-focused systems.
An analysis of ethical considerations related to emotion-aware educational technologies, including privacy protection and responsible deployment.
Organisation of the article
The remainder of this article is organised as follows. In “Literature review” section reviews related work on emotion-aware AI and educational technologies. In “Materials and methods” section describes the proposed framework, datasets, and methodological details. In “Experimental results and discussion” section presents experimental results and performance evaluation. Finally, in “Conclusion and future directions” section concludes the paper and outlines future research directions.
Literature review
Recent advances in artificial intelligence have significantly influenced educational technologies, particularly in the areas of personalised learning, adaptive feedback, and learner engagement. Contemporary research increasingly recognises that effective learning systems must account not only for cognitive performance but also for learners’ emotional states, as emotions directly influence attention, motivation, and persistence. As a result, emotion-aware AI systems have emerged as a key research direction in intelligent education.
Several recent studies have explored the integration of emotional intelligence into AI-driven educational platforms. Singh et al.1 highlighted how emotion-aware AI can bridge the gap between cognitive learning and personalised instruction by enabling systems to adapt dynamically to learners’ affective conditions. Similarly, Roumpas et al.2 proposed an ethical, cognitive-aware framework for multimodal adaptive learning systems, emphasising the importance of aligning emotional awareness with responsible AI practices. These works underline the growing consensus that emotional intelligence is a foundational component of next-generation educational AI.
Multimodal emotion recognition has gained particular attention due to its ability to capture complementary emotional cues from different sources. Islam et al.3 proposed a deep learning–based multimodal fusion approach that integrates facial expressions and textual sentiment, demonstrating improved emotion recognition performance. Salloum et al.13 further showed that real-time emotion recognition can be used to adapt teaching strategies dynamically, leading to enhanced learner engagement. Recent systematic reviews confirm that multimodal affective computing outperforms unimodal approaches in educational contexts by providing a more holistic understanding of learner emotions14.
Advances in representation learning have further strengthened emotion recognition capabilities. Liu et al.15 introduced a pose-aware contrastive facial representation learning framework that improves robustness under diverse visual conditions, which is particularly relevant for unconstrained educational environments. Complementing visual approaches, Zhang et al.16 proposed an EEG-based emotion recognition model capable of detecting subtle emotional variations, highlighting the potential of physiological signals, although practical classroom deployment remains limited. These studies demonstrate the increasing sophistication of emotion recognition techniques used in recent AI systems.
Graph-based and relational modelling approaches have recently emerged as powerful tools for multimodal emotion fusion. Meng et al.17 proposed a heterogeneous graph-based multi-message passing framework for conversational emotion recognition, showing that graph neural networks can effectively model complex relationships among emotional cues. Such approaches are particularly relevant for educational systems that must integrate facial, speech, and textual information while preserving contextual dependencies. Related studies on multimodal analytics pipelines further support the use of structured fusion strategies to enhance interpretability and robustness in learning environments18,19.
In parallel, AI-driven personalised learning frameworks have evolved to integrate emotional intelligence alongside cognitive modelling. Recent studies demonstrate that emotionally adaptive learning systems can improve engagement, reduce frustration, and support sustained task completion10,12,20. Privacy-preserving techniques such as federated learning have also been explored to address ethical concerns associated with emotional data usage, enabling personalised adaptation while safeguarding sensitive learner information8. These developments reflect the increasing importance of ethical and trustworthy AI in education.
Despite these advances, existing research reveals several limitations. Many systems still rely on limited fusion mechanisms or evaluate emotion recognition independently of educational outcomes. Moreover, few studies provide unified frameworks that combine multimodal emotion recognition, relational modelling, and real-world educational evaluation. Addressing these gaps, the present work builds upon recent advances in multimodal affective computing, graph-based fusion, and emotion-aware learning to propose an integrated framework that supports adaptive, empathetic, and ethically responsible AI-driven education. Table 1 presents a comparative analysis of existing research.
Table 1.
Summary of recent studies highlighting advances and limitations in emotion-aware AI-based educational systems.
| References | Focus | Technology used | Proposed model features | Outcome/impact | AI model type | Educational context |
|---|---|---|---|---|---|---|
| Singh et al.1 | Integration of cognitive and emotional intelligence in education | Transformer-based models, emotion recognition | Multimodal emotional intelligence for adaptive and personalised learning | Improved learner engagement and cognitive–emotional learning outcomes | Transformers, GNNs | Personalised and emotion-aware learning |
| Salloum et al.13 | Emotion recognition for adaptive teaching | Emotion recognition, adaptive AI feedback | Real-time emotional and cognitive state adaptation | Enhanced student satisfaction and engagement | Emotion recognition, Adaptive AI | Emotion-aware teaching systems |
| Soman et al.9 | Empathetic AI agents in learning and mental health | Reinforcement learning, empathy modelling | Emotion-responsive AI agents for learner support | Reduced anxiety and improved learner engagement | Reinforcement learning, Empathetic AI | Mental health support in education |
| Lateef11 | Wearable AI for emotional engagement | Wearable devices, machine learning | Continuous emotional monitoring and feedback | Improved emotional engagement and learning effectiveness | Wearable AI, ML | Emotion-aware educational wearables |
| Islam et al.3 | Multimodal emotion recognition | Deep learning, facial and text sentiment analysis | Multimodal fusion for real-time emotion detection | Enhanced emotion recognition accuracy with educational applicability | Deep learning, Multimodal | Healthcare analytics with educational relevance |
| Zhou et al.8 | Privacy-preserving personalised learning | Federated learning, multimodal data | Secure emotional and cognitive data fusion | Improved personalisation with enhanced privacy | Federated learning, Multimodal | Privacy-aware personalised education |
| Khediri et al.21 | Real-time multimodal intelligent tutoring | Multimodal emotion recognition, real-time feedback | Emotion-aware intelligent tutoring system | Increased engagement and learner satisfaction | Multimodal AI, Real-time systems | Adaptive intelligent tutoring |
| Vistorte et al.14 | Systematic review of emotion-aware AI in education | AI-based emotion recognition | Emotion-driven adaptive learning strategies | Higher engagement, reduced frustration, improved completion rates | Emotion-aware AI | Emotion-adaptive learning environments |
| Sajja et al.22 | AI-enabled intelligent learning assistants | Deep learning, adaptive AI | Cognitive and emotional adaptation in learning assistants | Improved learning effectiveness through emotional awareness | Deep learning, AI Assistants | Personalised adaptive learning |
| Chetry23 | Emotion detection in learning environments | Emotion detection algorithms | Real-time emotion-aware feedback | Increased learner engagement | Emotion detection, AI | Adaptive learning systems |
Materials and methods
The key materials and methods related to the research are as follows.
Dataset description
This section gives a detailed overview of the datasets used for emotion recognition in AI-powered educational systems. It specifically includes the AffectNet dataset for facial expression recognition and the IEMOCAP dataset for speech emotion recognition, which are used to train the deep learning models in this study21,24,25.
AffectNet dataset
AffectNet is a large facial expression dataset that was created particularly for the task of facial expression recognition. It contains around 0.4 million facial images annotated with facial expressions and valence-arousal values. The data source consists of 8 emotion classes: neutral, happy, angry, sad, fear, surprise, and disgust (contempt is not used). A global emotional profile is also provided for each image in which valence (pleasant vs. unpleasant) and arousal (activity level, from low to high) are continuously annotated per image. The annotations support both the categorical (emotion labels) and dimensional (valence, arousal) analysis of emotions, which is appropriate for many emotion recognition applications in intelligent systems24.
Nevertheless, it is necessary to note that AffectNet is a valuable data for FER, not totally collected from educational environments. Therefore, the emotional expressions in the dataset may not cover all aspects of emotions that students experience during educational tasks. These simulated facial expressions in the dataset could be different from the subtle and contextual emotions that students have in a real classroom. Also, the cultural bias in the sample, which is mainly from Western populations, can affect the generalizability to various types of students. To address the shortcomings, collecting domain-specific real classroom educational emotion data is a natural extension in the future. This data is aimed at extracting multimodal student emotional expressions when interacting with AI-based educational tools, which helps in better modelling the emotions, such as those elicited in academic settings.
IEMOCAP dataset
The IEMOCAP (Interactive Emotional Dyadic Motion Capture) dataset is a well-known multimodal corpus for emotion recognition. It consists of 302 recorded speech videos of spoken dialogues from 5 recording sessions among speaker pairs (10 speakers; 5 male and 5 female). These conversations were intended to elicit emotions, with the instruction for participants or speakers of conversation recordings to express different types of emotions. The emotion categories in the set are: Angry, Excited, Fear, Sad, Surprised, Frustration, Happy, Disappointment, and Neutral. As well as the emotion labels, this dataset comes with valence, arousal, and dominance ratings, which makes it a categorical and dimensional one.
As the IEMOCAP database is capable of capturing multimodal expressions of emotion in terms of speech, facial expressions (video), and text transcriptions, with a task for training an emotion-aware AI system. It is well-tailored for the multi-modal emotion recognition tasks, where emotions are desired to be extracted from speech, facial expression, and written language. In addition, while the IEMOCAP dataset was not recorded in an educational context, and may lack its extent in terms of the emotions that students express in a learning setting. The IEMOCAP has pre-scripted dialogues, which means the expression of emotion is not necessarily spontaneous or situation-based as it would be in a classroom. Also, the adult population of IEMOCAP restricts its use on student populations that may express emotions differently from adults. To make the IEMOCAP dataset more generalizable to educational scenarios, we intend to produce a new domain-dependent dataset that represents student–teacher interactions in real-life classrooms. It will enable a more precise and contextually grounded recognition of emotions in student–teacher interactions, as well as peer-to-peer exchanges in school environments. Table 2 presents the dataset details for AffectNet and IEMOCAP.
Table 2.
Dataset summary for AffectNet and IEMOCAP.
| Attribute | AffectNet dataset | IEMOCAP dataset |
|---|---|---|
| Dataset size | ~ 0.4 million facial images | 302 videos (151 sessions, 2 speakers per session) |
| Number of classes (emotions) | 8 classes: Neutral, Happy, Angry, Sad, Fear, Surprise, Disgust, Contempt | 9 classes: Angry, Excited, Fear, Sad, Surprised, Frustrated, Happy, Disappointed, Neutral |
| Data types (modalities) | Facial expressions only | Speech (audio), Video (facial expressions), Text (transcriptions) |
| Valence and arousal labels | Yes, continuous valence and arousal ratings | Yes, continuous valence, arousal, and dominance ratings |
| Emotion labelling type | Categorical (8 emotions), Dimensional (Valence, Arousal) | Categorical (9 emotions), Dimensional (Valence, Arousal, Dominance) |
| Speaker demographics | N/A | 10 speakers (5 male, 5 female) |
| Recording sessions | N/A | 5 sessions |
| Strengths | Large-scale dataset for facial emotion recognition | Multimodal dataset for emotion recognition across speech, video, and text |
| Limitations | Posed facial expressions, cultural bias | Adult speakers, scripted emotional dialogues |
Data pre-processing
Data preprocessing proved a pivotal part of building machine learning models, particularly for emotion recognition tasks where data quality and uniformity directly impact performance. This section discusses preprocessing applied to AffectNet and IEMOCAP. For IEMOCAP, preprocessing is divided into handling audio, video, and text. For AffectNet, focus fell on facial expressions. Preprocessing presented challenges like mitigating class imbalance, common in emotion recognition. Another difficulty centred on maintaining data integrity while standardising format and scale. The approaches strived to maximise information retention throughout a series of transformations aimed at purifying yet not oversimplifying datasets for optimal model constitution14,23.
Preprocessing of AffectNet dataset
The AffectNet dataset consists of facial images annotated with discrete emotional labels. To ensure data consistency and suitability for deep learning–based facial emotion recognition, a structured preprocessing pipeline was designed. This pipeline aimed to standardise facial representations, enhance data quality, and improve model generalisation. The main preprocessing steps are summarised as follows11,12,26.
-
Facial detection, landmark extraction, and alignment
Face detection was performed using state-of-the-art detectors, including MTCNN and Dlib, to accurately localise facial regions within each image. Following detection, facial landmark extraction was applied to identify key reference points such as the eyes, nose, and mouth. These landmarks were used to align facial images to a canonical orientation, thereby reducing pose variability and ensuring spatial consistency across samples. This alignment process enables the model to focus on discriminative facial features relevant to emotion recognition rather than extraneous background or pose-related variations20,27.
-
Image normalisation
Image normalisation was conducted using min–max scaling, where pixel intensity values were rescaled to the range [0, 1] by dividing each pixel value by the maximum possible intensity value (255). This approach standardises the input distribution, improves numerical stability, and facilitates faster convergence during model training. Min–max normalisation was selected over z-score normalisation because it preserves relative pixel intensity relationships, which is particularly suitable for convolutional and transformer-based vision models and ensures compatibility with pre-trained model initialisation. Furthermore, this method mitigates the effects of illumination and colour variations across images23,28.
-
Data augmentation
To enhance robustness and reduce overfitting, data augmentation techniques were applied to the training images. These transformations included random rotation, horizontal flipping, scaling, and zooming. By simulating real-world variations such as head pose changes and facial movement, data augmentation improves the model’s ability to generalise to unseen facial expressions under diverse conditions3,13.
-
Handling class imbalance
The AffectNet dataset exhibits notable class imbalance, with emotions such as happy and neutral being overrepresented, while others, including contempt and disgust, are underrepresented29,30. To address this issue and ensure balanced learning, multiple strategies were employed:
- Over-sampling Minority classes were augmented using synthetic data generation techniques such as the synthetic minority over-sampling technique (SMOTE).
- Under-sampling Samples from majority classes were selectively reduced to minimise bias towards dominant emotion categories.
- Class weighting During model training, higher class weights were assigned to underrepresented emotion classes to encourage the model to learn discriminative features across all categories more effectively25.
Preprocessing of IEMOCAP dataset
The IEMOCAP dataset comprises multimodal data, including speech audio, video-based facial expressions, and textual transcriptions, annotated with categorical and dimensional emotion labels. To ensure consistency and suitability for deep learning-based multimodal emotion recognition, a structured pre-processing pipeline was designed. The pipeline aimed to standardise modality-specific inputs, reduce noise, and preserve temporal and emotional information31. The main preprocessing steps are outlined below.
-
Audio pre-processing
Speech signals were resampled to a uniform sampling rate and subjected to noise reduction to minimise background interference. Pre-emphasis filtering was applied to enhance high-frequency components, followed by framing and windowing to segment the signal into short-time frames. Mel-Frequency Cepstral Coefficients (MFCCs) and their first- and second-order derivatives were extracted to capture spectral and temporal characteristics relevant to emotional expression32,33. Feature normalisation was then applied to stabilise training and improve convergence of the speech emotion recognition model.
-
Video and facial pre-processing
From the video recordings, facial frames were extracted at a fixed frame rate. Face detection and landmark extraction were performed using MTCNN and Dlib to localise and align facial regions. The detected faces were cropped, aligned to a canonical orientation, and normalised to ensure spatial consistency across frames. This process reduces variations caused by head pose, scale, and illumination, allowing the model to focus on discriminative facial emotion features18,31.
-
Text pre-processing
Textual transcriptions associated with each utterance were cleaned to remove punctuation, non-linguistic symbols, and transcription artefacts20. The cleaned text was tokenised and encoded using a BERT-based tokeniser, enabling contextual embedding generation that captures both semantic meaning and emotional nuance. Padding and truncation were applied to maintain uniform sequence lengths for efficient batch processing34,35.
-
Temporal alignment and synchronisation
To support effective multimodal fusion, audio, video, and text modalities were temporally aligned at the utterance level. This alignment ensures that emotional cues extracted from different modalities correspond to the same temporal context, enabling coherent multimodal learning and graph-based fusion36.
-
Handling class imbalance
The IEMOCAP dataset exhibits imbalance across emotion categories. To mitigate this issue, class weighting was applied during training to emphasise underrepresented emotions. Additionally, data augmentation techniques such as time stretching and pitch shifting were applied to speech samples belonging to minority classes, improving class balance while preserving emotional characteristics37,38.
Through these preprocessing steps, the IEMOCAP dataset was transformed into a clean, temporally aligned, and balanced multimodal dataset suitable for robust speech-, facial-, and text-based emotion recognition in emotion-aware educational AI systems.
Proposed model architecture
The proposed model is a layered, multimodal deep learning framework designed to embed emotional intelligence into AI-driven educational systems by jointly analysing facial expressions, speech signals, and textual interactions. As illustrated in Fig. 1, the framework begins with multimodal data acquisition, where facial images (AffectNet) and speech–text data (IEMOCAP) are processed through modality-specific preprocessing pipelines. Facial inputs undergo face detection, landmark alignment, min–max normalisation, and augmentation, while speech signals are denoised and transformed into temporal acoustic representations. Textual inputs are cleaned, tokenised, and encoded using a BERT-based contextual embedding strategy. Each modality is then processed by a dedicated deep learning backbone—Vision Transformers for facial emotion representation, Temporal Convolutional Networks (TCNs) for speech emotion modelling, and BERT-based encoders for text sentiment extraction, ensuring effective capture of modality-specific emotional patterns1,2,5.
Fig. 1.
The proposed model architecture.
To enable holistic emotion inference, the extracted features are integrated through a Graph Neural Network (GNN) that explicitly models cross-modal dependencies and relational interactions among emotional cues. This graph-based fusion mechanism facilitates robust multimodal representation learning by leveraging message passing across modalities, thereby enhancing interpretability and resilience to noisy or incomplete inputs. The fused representation is subsequently passed through fully connected layers, followed by a SoftMax classifier to predict learners’ emotional states, which are then used to drive adaptive, emotion-aware learning feedback. The complete functional workflow of the proposed framework, including data splitting, preprocessing, feature extraction, fusion, and classification, is formally detailed in Algorithm 1. Together, the architecture and algorithm establish a unified, scalable, and ethically aligned solution for emotion-aware educational AI systems.
Algorithm 1.
The proposed model architecture.
Vision transformers for facial expression recognition
While Emotion-Aware AI Educational Systems leverage cutting-edge technologies to personalise learning, they require nuanced emotional recognition. Vision Transformers discern subtle facial cues to gauge changing affective states39,40. However, a comprehensive understanding necessitates the fusion of disparate modalities: speech tonality, written language, and fleeting micro expressions jointly reveal the student’s inner experiences and reactions over the course. By perceiving emotion’s multidimensional nature through multisource input, the system can sensitively adapt instruction to enhance outcomes18,30. With great complexity comes great responsibility; developing such advanced systems demands constant care for learners’ well-being, growth, and privacy amid technological progress.
Mathematical framework for ViTs in facial expression recognition
The proposed system employs Vision Transformers for the analysis of facial images and the detection of emotional states19,26,30. This document outlines the process in detail:
Input image The input to the ViT model consists of a facial image characterised by dimensions H × W × C, where H represents height, W denotes width, and C indicates the number of channels (e.g., RGB). A facial image is captured via a webcam or camera during student interactions with the AI system.
-
Patch embedding In Vision Transformers, the image is divided into non-overlapping patches. The image I, with dimensions H × W × C, is segmented into patches of size P × P. For instance, when the image measures 224 × 224 pixels, it is segmented into patches of 16 × 16 pixels. Each patch xi is transformed into a one-dimensional vector and subsequently projected into a higher-dimensional space to generate the patch embeddings as presented by Eq. (1).

1 where
, xi represent the number of patches, embedded into a vector of dimension D. -
Positional encoding Positional encodings are added to each patch embedding to preserve spatial information because transformers do not naturally account for the spatial positions of patches. The patch embeddings ei are supplemented with the positional encoding PEi as presented by Eq. (2).

2 The model is better able to comprehend the relative locations of facial features (eyes, nose, mouth, etc.) thanks to positional encoding.
-
Transformer encoder layers Positional information is added to the patch embeddings, which are then sent through several transformer encoder layers. A feedforward network and multi-head self-attention make up each transformer layer40,41. The attention scores A, which indicate how much each patch should “attend” to other patches in the image, are calculated by the self-attention mechanism as presented by Eq. (3).

3 where The query, key, and value matrices are denoted by Q, K, and V, respectively, and the key vector’s dimension is represented by
. This mechanism enables the model to learn relationships between different facial regions (for example, the mouth and eyes) while also capturing global features like smiles, frowns, and raised brows. -
Feature extraction and output A feature vector,
, which reflects the emotion expressed by the facial expression, is the model’s final output after it has gone through the transformer encoder layers15,22. To predict the facial emotion (such as happy, sad, angry, etc.), this feature vector is subsequently run through a classification head, which is typically a SoftMax layer (Eq. 4).
4 where the predicted emotional state based on the facial expression is represented by
, the weight matrix by W, and the bias term by b. -
Example of vision transformers in the proposed system Facial expression recognition is essential in the Emotion-Aware AI Educational System for determining whether a student is frustrated or engaged34,35. For instance:
- Happy expression If the student’s facial expression is deemed happy, the system identifies that the student is likely engaged with the content and proceeds to deliver increasingly challenging material.
- Frustrated expression The system recognises possible emotional distress if the facial expression is categorised as frustrated and may step in by offering normalised results.
Thus, the ViT model’s predictions of facial emotions are incorporated into the system’s overall emotion-aware feedback loop, which adjusts learning content in real time in response to the student’s emotional state.
Role of ViTs in facial expression recognition
Facial expression recognition is absolutely essential for detecting the emotional states of students during their interactions with educational AI systems. A student’s facial cues, including the movement of their mouth, eyebrows, and eyes, provide invaluable insight into whether they are feeling joyful, bewildered, aggravated, focused, or bored. In the proposed system, Vision Transformers are deployed to investigate images of the student’s visage and classify their emotional expression5,19. Vision Transformers are specifically selected owing to their unparalleled talent for embracing comprehensive dependencies across the image, which is crucial for comprehending the delicate and intricate facial expressions that point to diverse emotions16. Notably, the Visual Transformers’ exceptional ability to perceive global connections proves paramount for interpreting the nuanced nonverbal signs that are so essential to recognising how a student truly feels throughout their engagement with an AI teaching platform (Fig. 2).
Fig. 2.

Vision transformer model working for facial expression recognition in the proposed system.
BERT-based models for text sentiment analysis
The Emotion-Aware AI Educational System utilises BERT-based models for text sentiment analysis, enabling the assessment of the emotional tone in student-provided text, including feedback, responses, or queries. Text sentiment analysis enables the system to assess student emotions such as engagement, frustration, happiness, or confusion from their text input, which is essential for providing personalised and emotionally attuned feedback18–20,28,32.
Mathematical framework of BERT-based models for text sentiment analysis
The BERT model, or Bidirectional Encoder Representations from Transformers, represents a leading approach in transformer-based natural language understanding. BERT effectively identifies the emotional tone present in a specific text during sentiment analysis. The model’s capacity to comprehend contextual relationships among words in a sentence renders it especially effective for sentiment analysis tasks within the Emotion-Aware AI Educational System, where grasping the emotional content of a student’s text, such as feedback, questions, or responses, is crucial. BERT utilises the transformer architecture, employing self-attention mechanisms and feed-forward neural networks to process text in parallel rather than sequentially23,24. This document outlines the mathematical framework that supports the BERT-based model for sentiment analysis.
-
Tokenisation and input representation
Input: A sentence made up of M words or tokens.
5 - A tokenizer (such as WordPiece) splits the input tokens into sub-words or words.
-
BERT uses special tokens:
- [CLS]: Added at the start of the sequence for classification tasks.
- [SEP]: Used at the end of a sentence or to separate multiple sentences.
For example, the tokenised input for “I love this course!” would be:
Input sequence: [CLS] I love this course [SEP].
Token
is represented as a vector with size D (embedding dimension). Embedding layer: Each token is embedded in a fixed-dimension vector D. The embeddings are represented as a sum of three parts.
Token embedding
: An embedding that has been learned for each token.Positional embedding
A learned embedding to retain the position of each token in the sequence.- Segment embedding
: Used to distinguish between sentences. The input embedding for a token
is computed as (Eq. 6):
6
The input embeddings for the entire sequence SSS are then passed to the BERT model.
- Transformer encoder layer: BERT processes input embeddings through transformer encoder layers. A transformer encoder layer includes two key operations:
- Self-attention: The attention mechanism enables the model to assess the importance of each token in relation to the others while taking into account the entire context of the sentence.
- Feed-FORWARD NETWOrk: The output is fed into a feed-forward neural network for additional processing following self-attention.
The query, key, and value vectors obtained from the input embeddings are used by the self-attention mechanism to calculate attention scores. The following Eq. (7) is used to calculate the attention score.
between tokens
and
.
![]() |
7 |
where
: Query and
: Key,
and
: Tokens, N: Number of tokens towards input sequence.
The attention mechanism produces the weighted sum of the value vectors can be measured by Eq. (8).
![]() |
8 |
where
shows a value vector towards the token
.
The output is normalised after passing through a feed-forward neural network with activation functions (such as ReLU) following the self-attention step.
Feature extraction
The final[CLS] token serves as the sentence’s representation in sentiment analysis. Following completion of each transformer layer, the[CLS] token’s output embedding is taken out and utilised as the classification feature vector (Eq. 9):
![]() |
9 |
Alternatively, feature extraction can involve pooling across all token embeddings, but the[CLS] token embedding is commonly used for classification tasks.
Sentiment classification
The[CLS] token’s final feature vector,
It is used to predict sentiment using a classification layer. The sentiment prediction is made using a SoftMax function as presented in Eq. (10).
![]() |
10 |
where
is the weight matrix of the classifier,
is the bias term and
is the predicted sentiment class (e.g., Positive, Negative, Neutral).
The BERT model, which stands for Bidirectional Encoder Representations from Transformers, is a transformer-based architecture that effectively comprehends contextual relationships among words within a sentence. BERT undergoes pre-training on extensive text data and is subsequently fine-tuned for specific tasks, such as sentiment analysis, to identify sentiments (positive, negative, neutral) within text (Fig. 3).
Fig. 3.

BERT architecture with its sub-layers.
Temporal convolutional networks for speech emotion recognition
TCNs are used to recognise speech emotions in the Emotion-Aware AI Educational System. TCNs analyse audio data to identify emotions based on the speech’s tone, pitch, speed, and other acoustic characteristics, much like ViTs do for facial expression recognition. Speech is essential for determining a student’s emotional state, which is necessary for dynamically modifying the educational process21,24,25,33,41. When modelling sequential data, like time-series audio signals, where the temporal dependencies in the speech data are crucial for emotion detection, TCNs are especially well-suited.
Architecture of TCN for speech emotion recognition
The TCN for speech emotion recognition has the following components (Fig. 4):
Input layer: The input is an audio signal, which can be a raw waveform or a feature-extracted representation (such as MFCCs or spectrograms)2,10.
- Convolutional Layer (With Dilation):
- Dilated convolutions are used to capture temporal dependencies at various time scales.
- The dilation factor broadens the receptive field of the convolutions, allowing the network to capture longer-term dependencies without increasing the number of layers.
- The convolution kernel is applied to the temporal dimension of an audio signal.
Causal convolutions: Ensures that the convolution only uses information from previous time steps, which is important for speech because the current state should only be based on past and present information.
Residual connections: These connections aid in training deeper networks and preventing vanishing/exploding gradient issues.
Fully Connected Layer: After passing through the convolutional layers, the output is routed through one or more fully connected layers to determine the emotional label.
Output Layer: The final layer is a SoftMax classifier, which predicts the emotional state using the TCN’s extracted features7.
Fig. 4.

Architecture of TCN for speech emotion recognition.
Role of TCNs in speech emotion recognition
TCNs are a subset of CNNs that are more effective and resilient than conventional RNNs at handling temporal patterns and sequential data. By concentrating on the speech’s temporal structure, TCNs are utilised to extract significant features from the unprocessed audio signal in the context of speech emotion recognition31,32. The key features of TCNs are as follows4,32.
Causal convolutions: These convolutions make sure that when the network predicts the output, it only uses historical data—not future data.
Dilated convolutions: Without needing many layers, these convolutions enable the network to capture long-range temporal dependencies.
Stable training: Compared to conventional RNNs and LSTMs, TCNs are simpler to train and converge more quickly.
The TCN model can be applied to speech emotion recognition by analysing features like:
Pitch (the perceived frequency of vocalisation)
Timbre (the quality of vocal sound)
Speech rate (the velocity of speech delivery)
Intensity (the loudness of vocalisation)
Through the analysis of raw audio signals or feature-extracted data (e.g., MFCCs), TCNs can discern patterns associated with various emotional states such as happiness, sadness, anger, and neutrality.
Graph neural networks for multimodal fusion
In the Emotion-Aware AI Educational System, GNNs facilitate multimodal fusion to synthesise and evaluate data from various modalities, including facial expressions, speech, and text. Multimodal fusion seeks to integrate various data types to develop a holistic comprehension of the student’s emotional condition. A GNN is a neural network explicitly engineered for processing graph-structured data. It can be utilised to elucidate the relationships and interdependencies among various modalities in multimodal emotion recognition systems9,26. Similar to how Vision Transformers facilitate facial expression recognition and temporal convolutional networks are utilised for speech emotion recognition, GNNs are adept at integrating various data types, specifically, distinct modalities such as facial expression, speech, and text into a cohesive representation that considers the interrelations among these modalities6.
Mathematical Model of GNN for multimodal fusion
The following STEPS can be used to represent the Graph Neural Network’s mathematical model12,19,42:
-
Graph construction
Let the graph be represented as
.Where
is the set of nodes (representing modalities, e.g., facial expression, speech, and text),
is the set of edges (representing the relationships between modalities).- Each node
has a feature vector
, which represents the modality-specific features (e.g., features from facial expressions, speech, or text).
-
Message passing (node update)
Every node in the network is responsible for updating its features during the process of message passing by collecting information from its neighbours. The newly added feature
for node
at iteration
is computed as presented by Eq. (11).
11 where
is the feature vector of the node
at iteration
,
represents the set of features of the neighbour’s
of node
.The UpdateFunction is typically composed of an aggregation function (such as sum, mean, or max) followed by a neural network layer (for example, a fully connected layer).
-
Aggregation of multimodal information
Upon completion of T iterations of message passing, the definitive feature vector for each node is acquired. The node feature
encapsulates the multimodal information for the modality i. The ultimate multimodal representation
is derived by consolidating the features of all nodes within the graph (Eq. 12):
12 where The final feature vectors for all nodes in the graph are derived from the AggregateFunction
, which may involve a straightforward concatenation or a pooling operation, such as mean pooling or max pooling4. -
Emotion classification
To classify the emotional state as shown by Eq. (13), the final multimodal representation
is passed through a fully connected layer (or a group of layers) and then a SoftMax activation (Eq. 13).
13 where W is the weight matrix of the classifier, b is the bias term,
is the predicted emotion (e.g., Happy, Sad, Angry).
GNN architecture for multimodal fusion
A graph neural network used for multimodal fusion typically has the following components (Fig. 5):
Input representation: Every modality (facial expression, speech, and text) is represented by a node in a graph. The nodes’ attributes are based on the features of these modalities, such as emotion-specific features extracted from facial expressions, speech prosody, and sentiment from text42.
-
Graph construction:
A graph is created where the following features are available10.
- Nodes represent modalities (such as facial expression, speech, and text).
- Edges represent the dependencies or relationships between modalities (for example, how facial expressions relate to speech tone or text sentiment).
-
Message passing:
- Each node collects information from its neighbours (i.e., from different modalities). This allows the GNN to learn how each modality influences the others through emotional cues.
- For example, information about the emotional tone in the speech modality can affect the analysis of facial expressions in the video modality.
Feature Aggregation: After several iterations of message passing, the node features are aggregated, and the final node representations are computed, capturing the combined information from all modalities.
Multimodal Fusion: The aggregated node representations are fed through a fully connected layer or attention mechanism, which combines the data into a single multimodal feature representation.
Emotional Classification: The fused multimodal representation is then fed through a classification layer (typically a SoftMax classifier) to predict the student’s emotional state (e.g., happy, sad, angry).
Fig. 5.

Architecture of GNNs for multimodal fusion.
Role of GNNs in multimodal fusion
The challenge in multimodal fusion is to combine data from various modalities (such as visual, auditory, and textual inputs) into a cohesive representation while retaining each modality’s unique characteristics. Graph Neural Networks accomplish this by representing the data from each modality as a graph and learning the interactions and dependencies between them18. The key features of GNNs in the proposed model are as follows (Fig. 6).
Graph representation: In a graph, each modality can be shown as a node, and the connections between them can be shown as edges. For example, the connections between facial expressions and speech tone are edges19.
Message passing: GNNs get information from their neighbours and use a message-passing system to keep node representations up to date. The model can then learn how the various modes affect one another.
Flexibility: GNNs are ideally suited for processes that involve the integration of multimodal data because of their capacity to manage irregular and complex data structures. This provides them with an advantage over other types of data structures28.
Fig. 6.
The key role of GNNs in multimodal fusion.
Model training and hyperparameter tuning
The models were trained on their datasets using backpropagation and gradient descent. Model optimisation was performed by the Adam optimiser with a learning rate scheduler which is used to modify the learning rate during the training to have good convergence. The training was done using the data in batches with a suitable batch size (e.g. 32, 64) and early stopping to avoid overfitting1,33,43. The emotion classification task used a cross-entropy loss function, which is appropriate for multi-class classification problems. For each training iteration, the model’s output was compared to the true emotional labels, and the loss was calculated as (Eq. 14).
![]() |
14 |
where
: ground truth label for class c,
: predicted probability for class c, and
: number of possible classes (e.g., Happy, Sad, Angry). The models were trained for a predetermined number of epochs (e.g., 50 or 100), and the validation loss was monitored throughout to ensure that the model did not overfit the training data32. Table 3 presents the summary of the hyperparameters used for ViTs, TCNs, and BERT-based models.
Table 3.
Hyperparameter details used for ViTs, TCNs, and BERT-based models.
| Hyperparameter | Vision transformers | Temporal convolutional networks | BERT-based models |
|---|---|---|---|
| Learning rate | 1e−4 | 1e−4 | 1e−5 |
| Batch size | 64 | 64 | 32 |
| Number of layers | 12 | 5 | 12 (BERT-base) |
| Dropout rate | 0.3 | 0.3 | 0.1 |
| Kernel size | N/A | 5 | N/A |
| Number of attention heads | 12 | N/A | 8 |
| Activation function | GELU | ReLU | GELU |
| Regularisation (L2) | 1e−4 | 1e−4 | 1e−4 |
| Optimizer | AdamW | AdamW | AdamW |
Model performance measuring parameters
This section outlines the performance metrics and evaluation parameters utilised to assess the effectiveness of emotion recognition models, including Vision Transformers, Temporal Convolutional Networks, and BERT-based models, in the context of the AI-driven educational system. The selected metrics aim to assess model accuracy, robustness, and their overall influence on the learning experience13,14,23,29,43.
Evaluation metrics
- Accuracy: Accuracy quantifies the proportion of correct predictions made by the model relative to the total number of predictions. This serves as a crucial measure of the model’s effectiveness in classifying students’ emotional states (Eq. 15).

15 - Precision: Precision quantifies the ratio of true positive predictions to the total number of positive predictions made by the model. The significance of this increases when the cost of false positives, such as misclassifying emotions, is substantial (Eq. 16).

16 - Recall: Recall measures the ratio of true positive predictions to the total number of actual positive instances. Detecting emotions is crucial, as failing to identify them (false negatives) can greatly affect the model’s effectiveness (Eq. 17).

17 - F1 Score: The F1 score represents the harmonic mean of precision and recall. This metric offers a comprehensive assessment of model performance in scenarios with class imbalance (Eq. 18).

18 Confusion matrix: The confusion matrix offers a comprehensive analysis of the model’s predictions, detailing true positives, false positives, true negatives, and false negatives for each class.
ROC curve and AUC (area under the curve): The ROC curve illustrates the relationship between the true positive rate (recall) and the false positive rate, while the AUC measures the model’s overall capacity to differentiate between classes, with a higher AUC signifying superior performance.
Batch processing and feedback evaluation
The system delivers feedback derived from batch-processed emotional data in the absence of real-time processing. The subsequent parameters assist in assessing the efficacy of the system’s emotional feedback2–5.
Student engagement: Student engagement denotes the degree of interaction and participation demonstrated by students during the learning process. Metrics such as time spent on tasks, the number of tasks completed, and interactions with the system are used for measurement. Engagement serves as a measure of the effectiveness of emotional feedback in sustaining or enhancing student interest in learning activities.
- Task completion rate: The task completion rate quantifies the ratio of completed tasks by students relative to the total tasks assigned (Eq. 19).

19
Experimental results and discussion
In this section, we will show the results of experiments performed on both the AffectNet and IEMOCAP datasets. We conduct experiments on multimodal emotion recognition, whose model is comprised of ViTs for facial expression recognition, TCNs for speech emotion recognition, and BERT-based representations of text sentiment analysis. Moreover, we contrast our results to previously proposed models for emotion recognition tasks, including CNNs in facial expression recognition, RNNs in speech emotion recognition, and BERT-based models in text sentiment analysis. The findings are considered in relation to the research questions, Hypothesis, and performance assessment measures identified above.
Simulation setup and details
To ensure reliable performance and accurate results, the proposed emotion-aware AI system was simulated using both hardware and software tools.
Hardware setup
The experiments used a high-performance GPU (NVIDIA RTX 3090) to speed up model training and inference, especially for deep learning components like ViTs, TCNs, BERT-based models, and GNNs1,2,5,6. The GPU significantly reduced training times and allowed for the efficient processing of large datasets. The system also included 32 GB of RAM and an Intel Core i9 processor to handle computationally intensive tasks like training large models and managing multimodal data streams17,23.
Software setup
The model was built using Python 3.8 and various deep learning frameworks, including3.
TensorFlow and Keras are used to build and train models, particularly those based on ViT, TCN, and BERT. Keras provided a simple interface for model development and fine-tuning, while TensorFlow enabled GPU-based distributed training3,4.
PyTorch was also used to build GNNs and perform multimodal fusion. PyTorch dynamic computation graph is ideal for complex models such as GNNs.
Hugging Face’s Transformers library was used to implement and fine-tune BERT-based models for text sentiment analysis7,8.
In addition, the AffectNet and IEMOCAP datasets were pre-processed and stored locally, with a custom data pipeline created to handle data augmentation, tokenisation, and feature extraction for facial, speech, and text data. For evaluation and testing, performance metrics such as precision, recall, F1 score, and accuracy were calculated using scikit-learn. The entire system was built on a Linux operating system (Ubuntu 20.04) to ensure stability and compatibility with deep learning libraries. To ensure efficient training and evaluation, the model was trained using the Adam optimiser with a learning rate of 0.001 and a batch size of 32.
Real-time performance considerations
Although the adopted model includes computationally expensive components such as TCN, BERT, and GNNs, we have designed the system with a focus on emotion recognition performance9,10. But there is an important problem with real-time performance. Since there is no real-time riddle-solving duration in our present investigation, we will be interested in optimizing for real-time application the model obtained, and we would like to check if the system produced can actually work well, also on an online interaction task.
Data set splitting
This study meticulously partitioned the datasets to facilitate appropriate training, validation, and testing phases for assessing the efficacy of the Proposed Emotion-Aware AI System. The partitioning process enables the training of the model on one data subset, validation of its performance on a distinct subset during training, and assessment of final performance on an unobserved test set. The AffectNet and IEMOCAP datasets were partitioned according to standard machine learning protocols to mitigate overfitting and guarantee effective model generalisation18,27. We employed the conventional [80–10–10] division for the datasets, signifying: 80% of the data was allocated for model training, 10% of the data was set aside for validation. After the model was trained, 10% of the data was designated as the test set for the final evaluation. This method guarantees that the test set remains entirely unexposed during the training phase, facilitating an equitable assessment of model efficacy.
Simulation results
Results on AffectNet dataset (facial expression recognition)
Figures 7 and 8 present a visual comparison of the proposed model with CNN and ResNet baselines on the AffectNet dataset. Figure 7 illustrates Accuracy and Precision, while Fig. 8 reports Recall and F1-score across all emotion categories. As evidenced by both figures and the numerical results in Table 4, the proposed ViTs model consistently outperforms existing approaches across all evaluation metrics, demonstrating improved robustness and balanced performance in facial expression recognition.
Fig. 7.
Comparison graph for proposed versus existing models on AffectNet dataset (accuracy and precision).
Fig. 8.
Comparison graph for proposed versus existing models on AffectNet dataset (recall and F1-score).
Table 4.
Results on AffectNet dataset for facial expression recognition (%).
| Emotion | Metric | Proposed model (ViTs) | Existing model (CNN) | Existing model (ResNet) |
|---|---|---|---|---|
| Happy | Precision | 96.34 | 88.67 | 90.51 |
| Recall | 95.45 | 87.56 | 88.33 | |
| F1-score | 95.12 | 87.98 | 89.12 | |
| Accuracy | 95.23 | 85.67 | 87.94 | |
| Sad | Precision | 94.58 | 84.89 | 85.12 |
| Recall | 92.77 | 82.55 | 83.09 | |
| F1-score | 93.24 | 83.67 | 84.45 | |
| Accuracy | 94.45 | 80.91 | 82.34 | |
| Angry | Precision | 91.34 | 80.23 | 82.19 |
| Recall | 89.65 | 78.45 | 80.12 | |
| F1-score | 90.12 | 79.22 | 81.01 | |
| Accuracy | 93.56 | 82.45 | 84.11 | |
| Surprise | Precision | 94.23 | 86.34 | 89.11 |
| Recall | 93.47 | 85.21 | 87.65 | |
| F1-score | 94.12 | 85.34 | 88.04 | |
| Accuracy | 94.32 | 83.21 | 86.78 | |
| Fear | Precision | 90.56 | 80.22 | 83.14 |
| Recall | 89.43 | 78.31 | 81.34 | |
| F1-score | 89.68 | 79.12 | 82.05 | |
| Accuracy | 90.34 | 79.76 | 81.92 | |
| Disgust | Precision | 91.22 | 81.56 | 83.14 |
| Recall | 90.21 | 79.43 | 80.25 | |
| F1-score | 90.32 | 80.12 | 81.01 | |
| Accuracy | 91.57 | 80.84 | 82.13 | |
| Neutral | Precision | 96.56 | 89.12 | 91.34 |
| Recall | 95.43 | 87.67 | 89.11 | |
| F1-score | 95.67 | 88.14 | 90.01 | |
| Accuracy | 95.62 | 88.74 | 90.76 | |
| Contempt | Precision | 88.12 | 79.35 | 81.11 |
| Recall | 87.22 | 77.58 | 79.12 | |
| F1-score | 87.31 | 78.22 | 80.14 | |
| Accuracy | 87.42 | 78.03 | 80.31 |
Results on IEMOCAP dataset (speech emotion recognition)
This section assesses the efficacy of the Proposed Model, ViTs, CNN, and ResNet on the IEMOCAP dataset for speech emotion recognition. The models were evaluated using Precision, Recall, F1-Score, and Accuracy across various emotions, including Anger, Happiness, Sadness, and Surprise.
Table 5, Fig. 9, and 10 display the performance metrics (Precision, Recall, F1 Score, and Accuracy) for three distinct emotion recognition models: TCNs, RNN, and LSTM across various emotional states: Anger, Happiness, Sadness, Surprise, Disgust, and Neutral. These results underscore the performance of each model in recognising emotions from speech data. The TCN model typically surpasses the RNN and LSTM models, attaining superior Precision, Recall, F1 Score, and Accuracy for the majority of emotions. The TCN model attained 92.34% precision for Anger and 95.67% precision for Happiness, demonstrating its robust capability to accurately discern emotional states from speech. The RNN model, although effective, exhibits diminished performance, particularly regarding accuracy and recall, achieving 82.11% accuracy for Anger and 87.43% for Happiness. The LSTM model, while superior to RNNs, remains inferior to TCNs in various instances, especially for Anger, where it attained an accuracy of 84.11% compared to TCNs’ 93.76%. Nonetheless, it demonstrates robust performance in Happiness with an accuracy of 91.56%. The TCN model demonstrates superior performance across nearly all metrics, indicating that Temporal Convolutional Networks are exceptionally adept at capturing long-term temporal dependencies in speech emotion recognition tasks.
Table 5.
Results on IEMOCAP dataset for speech emotion recognition (%).
| Emotion | Precision (TCNs) | Recall (TCNs) | F1-Score (TCNs) | Accuracy (TCNs) | Precision (RNN) | Recall (RNN) | F1-Score (RNN) | Accuracy (RNN) | Precision (LSTM) | Recall (LSTM) | F1-score (LSTM) | Accuracy (LSTM) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Anger | 92.34 | 90.45 | 91.12 | 93.76 | 84.56 | 82.14 | 83.23 | 82.11 | 86.23 | 84.42 | 85.34 | 84.11 |
| Happiness | 95.67 | 94.58 | 94.89 | 95.44 | 89.21 | 87.32 | 88.13 | 87.43 | 91.14 | 89.44 | 90.21 | 91.56 |
| Sadness | 90.82 | 88.23 | 89.74 | 90.23 | 80.11 | 78.56 | 79.34 | 80.12 | 82.45 | 80.34 | 81.21 | 82.78 |
| Surprise | 92.21 | 91.76 | 92.56 | 93.32 | 84.89 | 82.67 | 83.45 | 82.21 | 86.78 | 84.12 | 85.09 | 84.89 |
| Disgust | 89.33 | 87.25 | 88.46 | 89.99 | 81.67 | 79.43 | 80.55 | 81.21 | 82.89 | 80.12 | 81.67 | 82.34 |
| Neutral | 94.76 | 93.67 | 93.85 | 94.78 | 88.98 | 86.72 | 87.56 | 88.44 | 90.12 | 88.88 | 89.11 | 90.76 |
Fig. 9.
Comparative analysis of simulation results precision and recall on the IEMOCAP dataset for speech emotion recognition using the proposed and existing models.
Fig. 10.
Comparative analysis of simulation results, accuracy, and F1-score on the IEMOCAP dataset for speech emotion recognition using the proposed and existing models.
Results on multimodal fusion using GNNs
The outcomes of Multimodal Fusion (Table 6 and Fig. 11), the GNN model demonstrably surpasses the Traditional Fusion model in all metrics: Precision, Recall, F1-Score, and Accuracy. In the context of Happy emotion recognition, the GNNs model attained 96.48% Precision, 95.70% Recall, 95.13% F1-Score, and 95.26% Accuracy, whereas the Traditional Fusion model underperformed with 88.47% Precision, 87.29% Recall, 87.51% F1-Score, and 86.38% Accuracy. This trend is uniform across all emotions, with the GNNs model demonstrating significant enhancements in Precision, Recall, F1-Score, and Accuracy, especially in identifying nuanced emotions such as Neutral (with 96.28% Precision and 95.00% Recall). The GNN model surpassed the Traditional Fusion model in complex emotions like Anger and Fear, attaining 92.29% Precision and 91.03% Recall for Anger, and 90.51% Precision and 89.02% Recall for Fear. In contrast, the Traditional Fusion model recorded 80.45% Precision and 78.71% Recall for Anger, and 80.36% Precision and 78.88% Recall for Fear. These findings validate that GNNs are especially proficient in capturing multimodal emotional cues and subtleties, leading to enhanced recognition accuracy across a wide range of emotions.
Table 6.
Simulation results for multimodal fusion performance (%).
| Emotion | Precision (GNNs) | Recall (GNNs) | F1-Score (GNNs) | Accuracy (GNNs) | Precision (traditional fusion) | Recall (traditional fusion) | F1-score (traditional fusion) | Accuracy (traditional fusion) |
|---|---|---|---|---|---|---|---|---|
| Happy | 96.48 | 95.70 | 95.13 | 95.26 | 88.47 | 87.29 | 87.51 | 86.38 |
| Sad | 94.32 | 92.56 | 93.11 | 94.63 | 84.22 | 83.16 | 83.75 | 84.59 |
| Anger | 92.29 | 91.03 | 91.48 | 93.29 | 80.45 | 78.71 | 79.33 | 80.61 |
| Surprise | 94.42 | 93.72 | 94.09 | 94.37 | 87.14 | 85.52 | 86.21 | 85.94 |
| Fear | 90.51 | 89.02 | 89.64 | 90.76 | 80.36 | 78.88 | 79.52 | 80.43 |
| Disgust | 91.08 | 90.34 | 90.57 | 91.19 | 81.22 | 79.48 | 80.37 | 81.31 |
| Neutral | 96.28 | 95.00 | 95.17 | 96.11 | 89.19 | 87.67 | 88.30 | 88.74 |
Fig. 11.
Comparison of simulation results (accuracy recall) and for multimodal fusion performance.
Extended performance metrics
This section expands the model evaluation by integrating supplementary performance metrics in addition to the conventional precision, recall, F1-score, and accuracy. The extended metrics comprise area under the curve (AUC), receiver operating characteristic (ROC) curve, and confusion matrix. The incorporation of these metrics offers a more thorough comprehension of the model’s efficacy, particularly in distinguishing true positives, false positives, true negatives, and false negatives across diverse emotion categories. By analysing these comprehensive metrics, we can evaluate the models’ proficiency in accurately recognising emotions and reducing misclassifications, especially in complex or nuanced emotional expressions.
Confusion matrix
Confusion matrices for the AffectNet Dataset (facial emotion recognition) and the IEMOCAP Dataset (speech emotion recognition) were calculated using their respective test sets. AffectNet Dataset: The confusion matrix for the AffectNet test set was created using 40,000 facial images (Fig. 12). The Proposed Model (ViTs) outperformed all emotion classes, with especially high precision and recall for Happy, Neutral, and Sad. The confusion matrix’s diagonal cells represent correct predictions, whereas off-diagonal cells highlight misclassifications of similar emotions, such as Anger versus Sadness.
2.
Confusion matrix for proposed model for AffectNet test dataset.
The confusion matrix for the IEMOCAP test set (which includes 16,000 sound frames) demonstrates how well the Proposed Model classified various speech emotions (Fig. 13). The confusion matrix for speech emotion recognition revealed that the model performed well on emotions like Happiness and Surprise, but there were some misclassifications of Sadness and Fear. The scaled confusion matrix (by 25 times) aids in visualising performance with larger sample counts, highlighting the model’s accuracy and potential areas for improvement. These confusion matrices provide insights into the models’ performance, revealing both their strengths in recognising specific emotions and the difficulties in distinguishing similar emotional expressions or tones.
Fig. 13.
Confusion matrix for proposed model for IEMOCAP test dataset.
AUC ROC curve
The AUC ROC curve illustrates how well the Proposed Model distinguishes between various emotions in both the AffectNet (facial expression recognition) and IEMOCAP (speech emotion recognition) datasets. In the analysis:
AffectNet Dataset for Facial Expression Recognition: The ROC curve (Fig. 14) for the Proposed Model (ViTs) depicts the relationship between the True Positive Rate (TPR) and False Positive Rate (FPR) for emotion classification using facial expressions. High AUC values (close to one) indicate that the model accurately distinguishes emotions such as happy, neutral, and sad. Diagonal performance, with a high TPR and a low FPR, results in fewer misclassifications. The AUC value measures how well the Proposed Model distinguishes between emotions, with an ideal curve reaching the top left corner (FPR = 0, TPR = 1). Lower AUC values indicate that the model has difficulty distinguishing between specific classes.
IEMOCAP Dataset for Speech Emotion Recognition: Similarly, the ROC curve (Fig. 14) for the Proposed Model (TCNs) in speech emotion recognition reveals how well it can classify emotions based on audio cues. Higher AUC values indicate better emotion recognition, such as Happy, Surprised, and Neutral, and show that the model can effectively distinguish between speech tones. As with the AffectNet dataset, the ROC curve shows that the Proposed Model performs better when it achieves a high True Positive Rate while having a low False Positive Rate, indicating accurate emotion detection.
Fig. 14.
AUC ROC analysis of proposed model for both datasets.
Multimodal impact analysis
This section examines the impact of various modalities (facial expressions, speech, and text) on the overall efficacy of the emotion recognition system. The findings indicate that multimodal fusion, especially through the use of GNNs, markedly improves accuracy and robustness by integrating complementary emotional signals from various sources, resulting in superior emotion recognition performance relative to single-modality models.
Table 7 and Fig. 15 compare the precision achieved by different modalities: Facial Modality, Speech Modality, Text Modality, and Multimodal Fusion using GNNs for various emotional states: Happy, Sad, Anger, Surprise, Fear, Disgust, and Neutral. Multimodal Fusion with GNNs consistently outperforms each modality across all emotions. For example, in the Happy emotion, Multimodal Fusion with GNNs achieves 96.34% Precision, which is significantly higher than Speech (92.67%), Facial (91.34%), and Text Modality (88.12%). Similar trends are seen in other emotions such as sadness, anger, fear, and disgust, where the combination of facial, speech, and text modalities results in the highest Precision values. This demonstrates how combining multiple modalities using GNNs improves the model’s ability to detect complex emotional cues. Neutral emotion also performs well in Multimodal Fusion with GNNs, with a precision of 96.89%, compared to 90.45% for Facial Modality, 94.13% for Speech Modality, and 89.22% for Text Modality.
Table 7.
Comparative analysis for multimodal impact analysis (%).
| Emotion | Facial modality precision | Speech modality precision | Text modality precision | Multimodal fusion precision (GNNs) |
|---|---|---|---|---|
| Happy | 91.34 | 92.67 | 88.12 | 96.34 |
| Sad | 89.45 | 91.22 | 85.33 | 94.56 |
| Anger | 84.22 | 90.14 | 85.21 | 92.32 |
| Surprise | 86.17 | 91.39 | 84.56 | 94.12 |
| Fear | 81.22 | 89.43 | 86.31 | 90.14 |
| Disgust | 82.76 | 88.55 | 87.14 | 91.67 |
| Neutral | 90.45 | 94.13 | 89.22 | 96.89 |
Fig. 15.
Comparison of precision for different modalities and multimodal fusion (GNNs).
Analysis of misclassifications
In this section, we examine the misclassifications of various emotions in the emotion recognition model. Table 8 shows the misclassified emotion pairs, their respective accuracy, and the frequency of such misclassifications. These pairs are identified using the confusion matrix and represent cases in which the model frequently confuses two specific emotions (Table 8 and Fig. 16).
For example, the model frequently misclassifies Anger as Disgust, with a 91.32% accuracy and 8% frequency. This suggests that anger and disgust are frequently misclassified, possibly due to similar facial expressions or emotional cues.
Another common misclassification is Fear versus Sadness, with an accuracy of 95.22% and a frequency of 6%. This implies that both emotions may use similar tonalities or expressions in speech, resulting in frequent confusion.
The model also incorrectly classifies Happy as Neutral (93.92% accuracy, 5% frequency) and Surprise as Fear (92.93% accuracy, 4% frequency). These errors highlight the difficulties in distinguishing between subtle emotional states in real-world scenarios, where emotions such as Happy and Neutral or Surprise and Fear have similar characteristics.
Table 8.
Comparative analysis of misclassified emotion pairs (%).
| Misclassified emotion pairs | Accuracy (%) | Frequency (%) |
|---|---|---|
| Anger versus Disgust | 91.32 | 8 |
| Fear versus Sadness | 95.22 | 6 |
| Happy versus Neutral | 93.92 | 5 |
| Surprise versus Fear | 92.93 | 4 |
Fig. 16.
Misclassification analysis for emotion recognition.
Task completion impact
This section discusses the AI system’s impact on task completion rates and student engagement across different emotional states. Table 9 and Fig. 17 compare task completion rates before and after the AI system was implemented, as well as increases in engagement.
The proposed AI system resulted in significant improvements in task completion rates across all emotions. For example, the happy emotion increased from 75.43 to 95.19%, resulting in a 20% increase in engagement. This suggests that the AI system significantly increased student engagement and task completion among students experiencing positive emotions.
Similarly, emotions such as Sadness and Anger improved task completion rates, from 70.42 to 85.83% (+ 15%) and 65.74 to 80.93% (+ 15%), respectively. These findings indicate that the AI system is equally effective in assisting Generalisation, overcoming obstacles associated with negative emotions.
The Neutral emotion resulted in a + 13% increase in task completion, from 81.34 to 93.07%, demonstrating the system’s ability to summarise students who might otherwise be less involved.
Table 9.
Task completion rates and engagement increase (%).
| Emotion | Task completion rate (Before AI system) | Task completion rate (After AI system) | Engagement increase (%) |
|---|---|---|---|
| Happy | 75.43 | 95.19 | 20 |
| Sad | 70.42 | 85.83 | 15 |
| Anger | 65.74 | 80.93 | 15 |
| Surprise | 77.49 | 92.63 | 15 |
| Fear | 60.51 | 75.96 | 15 |
| Disgust | 68.95 | 83.85 | 15 |
| Neutral | 81.34 | 93.07 | 13 |
Fig. 17.
Task completion impact analysis.
Robustness and generalisation
Table 10 and Fig. 18 compare the performance of the proposed model (across all groups) to that of existing models (across all groups) for a variety of emotional states. The table summarises the overall performance (in terms of accuracy) of each emotion across all test groups. The proposed model outperforms the existing models in terms of accuracy across all emotions. For example, the Proposed Model has a Happy emotion accuracy of 95.23%, which is significantly higher than the Existing Models’ (85.67%). Sad and Anger show a significant difference between the proposed model (94.45% and 93.56%, respectively) and the existing models (80.91% and 82.45%). These findings suggest that the Proposed Model is more robust and applicable to a wide range of emotional states. For Fear and Disgust, the Proposed Model achieves 90.34% and 91.57% accuracy, respectively, compared to the Existing Models’ 79.76% and 80.84% accuracy.
Table 10.
Robustness and generalisation analysis for the proposed model across all groups (%).
| Emotion | Proposed model (across all groups) | Existing models (across all groups) |
|---|---|---|
| Happy | 95.23 | 85.67 |
| Sad | 94.45 | 80.91 |
| Anger | 93.56 | 82.45 |
| Surprise | 94.32 | 83.21 |
| Fear | 90.34 | 79.76 |
| Disgust | 91.57 | 80.84 |
Fig. 18.
Robustness and generalisation analysis for proposed model (across all groups).
Statistical significance (p-test and other tests)
Statistical tests are useful for determining whether observed differences in model performance are chance or statistically significant. To compare the performance of the proposed model to the existing models, we can use t-tests (for two models) or ANOVA (for multiple models).
p-Test for comparison of precision, recall, and F1 scores
- Objective: We aim to ascertain whether the variations between the Proposed Model and the Current Models (CNN, ResNet, RNN, LSTM) in the performance metrics (precision, recall, F1 score, and accuracy) are statistically significant.
- Null Hypothesis (H0): The performance of the Proposed Model and the Existing Models does not differ significantly.
- Alternative Hypothesis (H1): The performance of the proposed model is noticeably superior to that of the current models.
- p-Test for precision comparison:
- Group 1: The proposed model’s precision values (ViTs, TCNs, BERT, and GNN).
- Group 2: LSTM, CNN, ResNet, and RNN precision values.
We conduct a paired t-test to evaluate the precision of the happy emotion as presented in Table 11.
Table 11.
Results for paired t-test to evaluate the precision of the happy emotion.
| Emotion | Proposed model precision | Existing model precision | t-statistic | p value |
|---|---|---|---|---|
| Happy | 96.34% | 89.54% | 5.24 | 0.002 |
In the scenario that the p-value is lower than 0.05, we will reject the null hypothesis and come to the conclusion that the Proposed Model is capable of significantly outperforming the Existing Model in terms of precision.
Analysis of variance (ANOVA) for comparison of multiple models
- Objective: To determine whether the performance of the various models (Proposed Model, CNN, ResNet, RNN, LSTM) differs significantly, run an ANOVA.
- Null Hypothesis (H0): There are no appreciable variations in the performance of any model.
- Alternative Hypothesis (H1): There are significant performance differences between the models.
Table 12 and Fig. 19 show the F1 Scores of the Proposed Model (with ViTs, TCNs, BERT, and GNN), CNN, ResNet, RNN, and LSTM for Happy and Sad emotions. The results show that the Proposed Model outperforms the existing models in terms of F1 Score for both emotions by a significant margin, as indicated by the ANOVA F-statistics and p values. A p value below 0.05 signifies that the disparity in F1 scores is statistically significant among the models (Table 12). In this instance, regarding the Happy emotion, we reject the null hypothesis and determine that the Proposed Model significantly surpasses the others in F1 score.
Table 12.
ANOVA results for F1 scores.
| Emotion | Proposed model (ViTs, TCNs, BERT, GNN) | CNN | ResNet | RNN | LSTM | F-statistic | p value |
|---|---|---|---|---|---|---|---|
| Happy | 0.95 | 0.87 | 0.89 | 0.85 | 0.88 | 9.45 | 0.0002 |
| Sad | 0.93 | 0.83 | 0.84 | 0.81 | 0.82 | 8.13 | 0.0006 |
Fig. 19.
Comparison of F1 scores for different models on happy and sad emotions.
The outcomes of the p-tests and ANOVA (Table 13) indicate that the Proposed Model significantly surpasses the Existing Models across various performance metrics (precision, recall, F1 score, and accuracy) for all emotions. The p values for the Proposed Model (ViTs, TCNs, BERT, GNN) are substantially below the 0.05 threshold, signifying that these differences are statistically significant. The findings corroborate the hypothesis that sophisticated deep learning models, such as the Proposed Model, improve emotion recognition efficacy relative to conventional models.
Table 13.
Results from statistical testing for all emotions.
| Emotion | t-statistic (Precision) | p value (Precision) | F-statistic (F1) | p value (F1) |
|---|---|---|---|---|
| Happy | 5.24 | 0.002 | 9.45 | 0.0002 |
| Sad | 4.12 | 0.004 | 8.13 | 0.0006 |
| Anger | 3.45 | 0.008 | 7.91 | 0.0009 |
| Surprise | 4.59 | 0.003 | 8.65 | 0.0004 |
| Fear | 2.67 | 0.019 | 6.32 | 0.003 |
| Disgust | 3.12 | 0.011 | 7.47 | 0.0011 |
| Neutral | 5.15 | 0.001 | 9.78 | 0.0001 |
k-fold cross-validation analysis
We show the results of K-fold cross-validation (k = 5) used to test the robustness and generalizability of the proposed emotion-aware AI model in this section. Testing the performance of a model is one of the most important parts in machine learning; cross-validation, as a routine method, divides the dataset into several parts to train and estimate. The dataset was divided into 5 > (= times) equal subsets and trained > tested on them k = fivefold. This guarantees that all of the data points will be used in both training and testing, yielding more comprehensive evaluation results. For the analysis here, the proposed model that combines Vision Transformers (ViTs), Temporal Convolutional Networks (TCNs), BERT-based models, and Graph Neural Networks (GNNs) to perform multimodal fusion was k-fold cross-validated. The mean performance (from k = 1 to k = 5) with individual folds for Precision, Recall, f1-score and Accuracy was obtained. The results obtained over the folds for each emotion are shown in Table 14.
Table 14.
K-fold cross-validation results for the proposed model (K = 5, values in %).
| Emotion | Metric | Fold 1 | Fold 2 | Fold 3 | Fold 4 | Fold 5 | Average |
|---|---|---|---|---|---|---|---|
| Happy | Precision | 96.12 | 96.25 | 96.40 | 96.34 | 96.58 | 96.34 |
| Recall | 95.20 | 95.39 | 95.50 | 95.45 | 95.60 | 95.43 | |
| F1-score | 95.66 | 95.81 | 95.95 | 95.87 | 96.01 | 95.86 | |
| Accuracy | 95.10 | 95.35 | 95.52 | 95.47 | 95.68 | 95.42 | |
| Sad | Precision | 94.23 | 94.35 | 94.52 | 94.58 | 94.62 | 94.42 |
| Recall | 92.57 | 92.67 | 92.82 | 92.77 | 92.84 | 92.73 | |
| F1-score | 93.40 | 93.51 | 93.67 | 93.64 | 93.73 | 93.59 | |
| Accuracy | 93.20 | 93.30 | 93.50 | 93.45 | 93.60 | 93.41 | |
| Anger | Precision | 91.25 | 91.30 | 91.50 | 91.34 | 91.40 | 91.36 |
| Recall | 89.70 | 89.81 | 89.93 | 89.65 | 89.78 | 89.77 | |
| F1-score | 90.47 | 90.56 | 90.71 | 90.50 | 90.59 | 90.57 | |
| Accuracy | 90.34 | 90.45 | 90.58 | 90.78 | 90.65 | 90.56 | |
| Surprise | Precision | 93.15 | 93.22 | 93.40 | 93.23 | 93.28 | 93.26 |
| Recall | 92.38 | 92.46 | 92.60 | 92.47 | 92.52 | 92.48 | |
| F1-score | 92.76 | 92.84 | 92.99 | 92.85 | 92.90 | 92.87 | |
| Accuracy | 92.51 | 92.56 | 92.69 | 92.62 | 92.68 | 92.61 | |
| Neutral | Precision | 96.12 | 96.20 | 96.38 | 96.56 | 96.50 | 96.35 |
| Recall | 95.31 | 95.40 | 95.47 | 95.43 | 95.50 | 95.42 | |
| F1-score | 95.71 | 95.80 | 95.85 | 95.88 | 95.94 | 95.83 | |
| Accuracy | 95.47 | 95.55 | 95.67 | 95.69 | 95.74 | 95.62 |
The stability and effectiveness of the proposed emotion-aware AI model are evident from Table 13, which reveals the trustworthy performance of. For instance, in the detection of Happy emotion, the model obtained an average precision of 96.34%, a recall of 95.43% and an accuracy of 95.42% for all five folds. This is a strong clue that we have good, stable performance between all folds. The accuracy for Fold 5 was on the higher side at 95.68%. Similarly, in the case of the Neutral emotion model gave maximum overall accuracy (95.62%), precision (96.35%) and recall (95.42%), which indicates that the models are effective in capturing complex emotions too. Perhaps most importantly, the model outperforms every single modality source (on its own) like ViT or TCNs individually in Sad and Anger emotions, with average accuracy estimation being 93.41% and 90.56%, respectively, meaning it acts as a balanced and reliable methodology across all emotion classes alike.
Ablation study on the proposed model
An ablation study was performed to analyse the contributions of individual components in the emotion-aware AI system. In this work, we systematically analyse how removal or replacement of specific parts of the model, such as Vision Transformers (ViTs), Temporal Convolutional Networks (TCNs), BERT-based models, Graph Neural Networks (GNNs), etc., affects the performance. We will quantify contributions of each part by comparing the full model with versions lacking or simplifying a given component. The following key concepts were considered during ablation analysis.
Full model: ViTs, TCNs (TCN-32), BERT and GNNs used for multimodal fusion.
ViTs-only: Employs only ViTs for facial expression recognition and discards all the other modalities.
TCNs-only: It uses only TCNs in speech emotion recognition and disregards visual and text inputs.
BERT-only: The non-verbal input itself is not used in text sentiment analysis.
Fusion w/o GNNs: Concatenates or dynamically weights the features to link ViTs, TCNs and BERT end-to-end without using GNNs.
The results of the ablation study (Table 15) unambiguously indicate that the full model ViT, TCN-BERT, and GNN provide superior or at least comparable performance compared to others for all emotions. The ViTs-only model (based on face-awareness), with only utilising the visual input, and being trained on facial expression recognition, has done well for emotions like Happy, but fails when it comes to more complex emotions like Sadness and Neutral, demonstrating that vision is not enough for an accurate emotion detection task. This trend is also observed with the TCNs-only and BERT-only shape-space models, as both speech and text focus in recognition would lead to lower accuracy when it comes to emotion detection that needs multimodal cues. Their model without GNNs, where simple fusions such as concatenation or attention are used, still allows slight performance gains over individual modalities, but it falls short in comparison with the GNN-based fusion. For example, Neutral emotion recognition experiences a significant accuracy increase with the complete model supplemented with GNNs (96.11%) compared to a simpler fusion strategy (88.74%).
Table 15.
Ablation study results for emotion recognition (%).
| Emotion | Model | Precision (%) | Recall (%) | F1-score (%) | Accuracy (%) |
|---|---|---|---|---|---|
| Happy | Full Model (ViTs + TCNs + BERT + GNN) | 96.34 | 95.45 | 95.12 | 95.19 |
| ViTs-only | 88.67 | 87.56 | 87.98 | 87.67 | |
| TCNs-only | 80.11 | 79.56 | 79.83 | 80.15 | |
| BERT-only | 81.23 | 80.98 | 81.02 | 81.11 | |
| Fusion without GNNs | 88.47 | 87.29 | 87.51 | 86.38 | |
| Sadness | Full Model (ViTs + TCNs + BERT + GNN) | 94.58 | 92.77 | 93.24 | 94.45 |
| ViTs-only | 84.89 | 82.55 | 83.67 | 83.45 | |
| TCNs-only | 77.92 | 76.11 | 76.99 | 77.30 | |
| BERT-only | 79.56 | 78.23 | 78.89 | 79.11 | |
| Fusion without GNNs | 83.92 | 82.17 | 82.56 | 82.88 | |
| Anger | Full Model (ViTs + TCNs + BERT + GNN) | 91.34 | 89.65 | 90.12 | 90.78 |
| ViTs-only | 80.23 | 78.45 | 79.22 | 79.30 | |
| TCNs-only | 85.12 | 83.87 | 84.49 | 84.61 | |
| BERT-only | 77.45 | 75.92 | 76.17 | 76.35 | |
| Fusion without GNNs | 84.52 | 83.19 | 83.85 | 83.71 | |
| Surprise | Full Model (ViTs + TCNs + BERT + GNN) | 94.23 | 93.47 | 94.12 | 94.28 |
| ViTs-only | 86.34 | 85.21 | 85.34 | 85.42 | |
| TCNs-only | 80.76 | 79.34 | 79.88 | 79.92 | |
| BERT-only | 79.12 | 78.45 | 78.88 | 78.96 | |
| Fusion without GNNs | 87.13 | 85.87 | 86.00 | 85.75 | |
| Neutral | Full Model (ViTs + TCNs + BERT + GNN) | 96.56 | 95.43 | 95.67 | 96.11 |
| ViTs-only | 89.12 | 87.67 | 88.14 | 88.45 | |
| TCNs-only | 81.34 | 79.56 | 80.45 | 80.90 | |
| BERT-only | 82.23 | 80.79 | 81.51 | 81.11 | |
| Fusion without GNNs | 89.19 | 87.67 | 88.30 | 88.74 |
From the results, it can be seen that all components are critical. ViTs are important for analysing visual cues of emotion, but are not sufficient for understanding the complex emotions that depend also on speech and text. Likewise, TCNs enhance speech emotion recognition, and BERT can help interpret textual cues; however, neither of them can completely capture the multi-dimensional aspects of emotions. The GNN-based fusion is important for the incorporation of three modalities and accurately modelling their complicated interdependent relationships. This ability allows the model to perform better and more secure emotion recognition, especially for subtle emotional information, demonstrating the importance of multimodal fusion with GNNs in the final performance.
Ethical considerations (GDPR & FERPA compliant)
Moreover, when using public datasets such as AffectNet and IEMOCAP, among others that contain emotional information of their users, it becomes crucial to consider the ethical aspects related to the use of this sensitive emotional data: in particular, decisions have to be made about privacy and data collection (user consent, data protection). Although these datasets have been made available to the public (on repositories like Kaggle and GitHub, among others), we must continue to pay attention to the legal and ethical obligations surrounding their use. This section describes the GDPR and FERPA’s principles and takes you through the journey of becoming compliant with these principles, focusing on privacy and data protection within the AI emotion recognition system.
GDPR compliance
The GDPR is a regulation in the EU law that regulates the collection, retention and processing of personal data. While the AffectNet and IEMOCAP datasets are public, we acknowledge that ethical standards still need to be maintained in applying any personal or emotional data in this study, as dictated by the EU’s GDPR. These steps are taken to make sure that nothing is done in violation of it:
Data anonymisation: The AffectNet and IEMOCAP have been de-identified by the dataset creators, thus they do not include any PII (Personally Identifiable Information) that will directly identify a particular individual (e.g., names or direct identifiers). Nonetheless, the content of emotion recognition data is still sensitive in a way, since it contains personal emotional expressions. To protect privacy, when using the emotional data in this study, we won’t bind any personal information with the emotional information.
Minimisation feature-aided selection: Emotion faces only include the features used for emotion recognition as required by the data minimisation rule described in the GDPR. We care about only facial expressions and speech tones, as well as the sentiment of text, without collecting any needless private information, which may infringe on users’ privacy.
Data Protection: All data (including AffectNet and IEMOCAP datasets) is stored in encrypted format. Data is accessed by authorised persons, and tight access control of the study is ensured. Additionally, all data is stored on secure local servers, which protect against access or disclosure.
Participant consent: While the data used in this study are public and de-identified, it is important to recognise that original participants provided written informed consent for the use of their data. Since the data is publicly available, we suppose that permission has been granted by the dataset providers for research purposes. But if the system was used to gather additional user data (e.g. in training contexts), we would request explicit consent from participants according to GDPR guidelines.
Access, Rectification and Erasure of Rights: If the system contains a user data collection mechanism directly from users, then the ability for a consumer to have access to it, correct or delete at any stage of processing is important. This would be consistent with the right to erasure under GDPR.
FERPA compliance
The Family Education Rights and Privacy Act (FERPA) safeguard the privacy of students’ educational records in the USA. Although FERPA is predominantly targeted at educational institutions and how they manage student data, the lessons that FERPA provides on privacy and control are very much applicable in the context of AI in education. In the present research, in which students’ emotional states during learning are measured with emotion recognition, we address two issues.
Educational Setting: Although the AffectNet and IEMOCAP datasets are not constructed in educational settings, if the system is used in education (e.g., classroom setting), all emotion data would be classified as part of student records. Pursuant to FERPA, any information garnered from students would be used solely for educational purposes (e.g., tailoring instruction according to the emotional state of students). In those situations, students or their guardians would be asked for explicit consent before any data was collected.
Access to Data: FERPA prohibits access to educational records by unauthorised people. So, any emotional information obtained would be under tight control. Only certain personnel with an educational interest in the data would have access to it, so students’ emotional reactions can only be used to enrich the learning experience, and not for adversarial purposes.
Privacy: The emphasis of FERPA is on privacy, and hence, OCT’s student data, including emotional information used in this study, is treated with high confidentiality. Any identifiable emotional data would be secured and only disclosed to trusted users, including educators or administrators who would leverage the information in order to adapt a learning environment based on sensed emotions.
Direct transparency and ethical use of AI
Emotion recognition data is a sensitive one, and it is necessary to give a signpost for AI systems to fulfil their duty or use it ethically and provide transparency on operation. This includes:
Bias and Fairness: Both AffectNet and IEMOCAP datasets are biased, such as in terms of demographics (e.g., the AffectNet dataset contains an overweight of Western faces). To counter-balance bias, we will continue testing and optimising our emotion recognition model for fairness on a wide range of populations (races, genders, cultures). In addition, we utilise data augmentation and fairness-aware learning approaches to enhance the model generalisation across a heterogeneous set of learners.
Explainability: The AI solution used for emotion recognition should be transparent and provide a clear explanation of why it has detected a given emotional state. This is particularly important for educational settings, where decisions made with emotional data can impact student learning greatly. To make the decision process understandable to students and teachers, methods like explainable AI (XAI) are integrated.
Data Minimisation and Ethical Issues: The system is designed to handle exclusively the emotional data required for the adaptation of the learning environment. There is no added or alternate information since no non-essential personal data was collected to meet ethical requirements. There are also frequent audits and guarding to check that the data use is ethical and within the privacy law.
In a nutshell, this study follows the GDPR and FERPA Regulations for Privacy-Preserving User Consent and Data Protection in Emotional Data Management. Our ethos is privacy by design, where, through our anonymisation, secure storage and data minimisation processes, we hold ourselves to the highest standards of ethical AI in Education. In addition, we make sure that the emotional information is used in a fair and transparent way and prevent bias. These are the building blocks of a system that protects privacy, builds trust between users and generates useful data to improve educational outcomes.
Conclusion and future directions
Conclusion
In this work, we have proposed a new emotion-based AI system that combines multimodal emotional intelligence (EI) for enriching learning. The proposed system implements state-of-the-art deep learning methodologies that include Vision Transformers (ViTs) for facial expression recognition, Temporal Convolutional Networks (TCNs) for speech emotion recognition, and BERT-based approaches for text sentiment analysis in unimodal processing and Graph Neural Networks (GNNs) to integrate information across several modalities. This dual system integration allows the model to judge and react to students’ affective-cognitive states instantly, which makes the learning environment more dynamic and adaptive according to learners’ individual differences. To assess the performance of our system, we carried out extensive experiments on two publicly available and well-known datasets, AffectNet for facial expression recognition purposes and IEMOCAP for speech sentiment analysis tasks. These datasets, which included facial, speech and textual emotional cues, offered a strong basis for measuring the system’s performance in understanding that complex multimodal emotional data expressed in an educational environment.
We could show from our experiments that the emotion-aware AI system outperformed compared with traditional cognition-only AI models. It showed that the system outperformed recent models: RNN, LSTM, CNN and ResNet in terms of several evaluation metrics. It also achieved 96.34% precision for happiness, 96.56% precision for neutral emotions, and there was a 15–20% improvement in student engagement, a 20–25% reduction in frustration, and a 15–20% increase in students’ successful rates of task completion compared to traditional AI systems composed with only the cognitive factors. What distinguishes our proposed system is the multimodal fusion through GNNs, which allows the learning of complex relations of facial expressions and their tone with textual sentiment. This enables a deeper understanding of emotion, yielding a marked increase in accuracy and context-sensitivity of emotion detection. With the implementation of GNNs, the system becomes more adaptable to various emotional cues, providing personalised and empathetic responses on the go.
These results suggest that emotion-aware AI systems can enhance student engagement and learning, while also helping to create more supportive, responsive and emotionally intelligent educational contexts. By combining emotional intelligence with deep learning models, our system could modulate the conventional one-size-fits-all education style into a more user-friendly and adaptable style. In the future, we will refine our system for real-time application in larger educational settings and investigate additional linkages with adaptive learning technologies (e.g., intelligent tutor systems or VR-based platforms).
Future directions
Although the emotion-aware AI system described here is promising, several critical aspects need to be improved:
Better Recognition of Subtle Emotions: Future research can explore improvements in classifying refined emotions such as fear, contempt, and disgust. If specialised models and more extravagant data augmentation were used, the model might be better at picking up on these delicate emotional cues.
Enhanced Multimodal Fusion: We will perform the future investigation of multimodal fusion, particularly with GNNs, to understand how modality-specific emotional cues relate to one another. The dynamic weighting and prioritisation of modalities based on context (say stress, excitement) will make the system adaptable.
Real-Time Optimisation: To have the possibility of real-time implementation, further research should be conducted to investigate optimisation strategies such as pruning and quantisation (or edge processing). It will continue to remain a challenge to balance the accuracy and latency for implementing the system in mobile and embedded setups.
Cross-Cultural and Cross-Domain Validation: To establish the model’s applicability to various cultural groups and age ranges, cross-cultural testing of the model is critical. We hope that future research will test how well the model works on a larger variety of datasets, to guarantee it behaves as expected in different educational environments.
Ethical issues: As emotion data tends to be sensitive, the ethical issues relating to privacy, consent and data security must be addressed. It also must include a visible user consent solution that allows ethical usage of emotional data.
Long-term engagement effects: Future longitudinal research is needed to investigate the long-term impact of emotion-aware systems on student engagement, learning behaviours, and emotional well-being towards maintaining a positive effect over time.
Interfacing With Other Educational Technologies: Future work should concentrate on interfacing the emotion-aware system with other adaptive technologies, e.g., intelligent tutoring, gamification or virtual reality applications to provide a more immersive form of personalised learning environments.
Interpretability and Explainable AI: As future work, the SHAP (Shapley Additive exPlanations) and Grad-CAM (Gradient-weighted Class Activation Mapping) shall be integrated into our method for model interpretability. SHAP will enable them to describe the relative contribution each of these input features (e.g. facial expression, speech tone and text) made to the model’s decisions, thus making the system prediction more transparent as well as understandable for educators. This will be especially relevant in order to know how different cues (facial, speech or text) related to emotions are influencing the system. These provide not only accurate predictions but also clear explanations of why the AI system take certain actions, therefore increasing trust and confidence in using AI.
Acknowledgements
The author extends their appreciation to Taif University, Saudi Arabia, for supporting this work through project number (TU-DSPP-2024-17).
Abbreviations
- AI
Artificial intelligence
- FER
Facial expression recognition
- NLP
Natural language processing
- CNN
Convolutional neural network
- LSTM
Long short-term memory
- ViT
Vision transformer
- GNN
Graph neural network
- SMOTE
Synthetic minority over-sampling technique
- Dlib
Digital library (face processing toolkit)
- IoT
Internet of Things
- EEG
Electroencephalogram
- GDPR
General data protection regulation
- IEMOCAP
Interactive emotional dyadic motion capture
- SoftMax
Soft maximum function
- EI
Emotional intelligence
- SER
Speech emotion recognition
- DL
Deep learning
- RNN
Recurrent neural network
- TCN
Temporal convolutional network
- BERT
Bidirectional encoder representations from transformers
- MFCC
Mel-frequency cepstral coefficients
- MTCNN
Multi-task cascaded convolutional neural network
- FC
Fully connected
- FL
Federated learning
- HCI
Human–computer interaction
- FERPA
Family educational rights and privacy act
- AffectNet
Facial emotion dataset
- MF
Multimodal fusion
Author contributions
Umesh Kumar Lilhore, Xiaoyu Wu conceptualised the research, designed the methodology, and contributed to data analysis and result interpretation. Tientien Lee was responsible for data collection and played a key role in experimental work while assisting in manuscript drafting and revision. Umesh Kumar Lilhore focused on statistical analysis, data visualisation, and contributed to writing the discussion section. Sarita Simaiya supported laboratory work, experimental processes, and manuscript editing. Roobaea Alroobaea contributed to the literature review and assisted with manuscript revisions. Abdullah M. Baqasah provided technical support during data collection, validated results, and contributed to the methodology. Majed Alsafyani helped with data analysis and interpretation and provided feedback on the manuscript. Finally, Lidia Gosy Tekeste, as the corresponding author, oversaw the project, coordinated the team, and wrote the final manuscript, ensuring the research was completed.
Funding
This research was funded by Taif University, Taif, Saudi Arabia, project number (TU-DSPP-2024-17).
Data availability
The dataset is available from the corresponding author upon individual request.
Declarations
Competing interests
The authors declare no competing interests.
Consent for publication
All authors have reviewed and approved the final manuscript for publication.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Tientien Lee, Email: lee.tt@fsmt.upsi.edu.my.
Umesh Kumar Lilhore, Email: umeshlilhore@gmail.com.
Lidia Gosy Tekeste, Email: lidiagosytekeste@gmail.com.
References
- 1.Singh, T. M., Reddy, C. K. K., Murthy, B. V. R., Nag, A. & Doss, S. AI and education: Bridging the gap to personalized, efficient, and accessible learning. In Internet of Behavior-Based Computational Intelligence for Smart Education Systems, 131–160 (IGI Global, 2025).
- 2.Roumpas, K., Fotopoulos, A. & Xenos, M. A framework for ethical, cognitive-aware human–AI interaction in multimodal adaptive learning systems. In Cognitive-Aware Human–AI Interaction in Multimodal Adaptive Learning Systems.
- 3.Islam, M. M., Nooruddin, S., Karray, F. & Muhammad, G. Enhanced multimodal emotion recognition in healthcare analytics: A deep learning-based model-level fusion approach. Biomed. Signal Process. Control94, 106241 (2024). [Google Scholar]
- 4.Qiang, S. U. N. Deep learning-based modeling methods in personalized education. Artif. Intell. Educ. Stud.1(1), 23–47 (2025). [Google Scholar]
- 5.Hadinezhad, S., Garg, S. & Lindgren, R. Enhancing inclusivity: Exploring AI applications for diverse learners. In Trust and Inclusion in AI-Mediated Education: Where Human Learning Meets Learning Machines, 163–182 (Springer, Cham, 2024).
- 6.Kumar, R., Kumar, P., Sobin, C. C. & Subheesh, N. P. Blockchain and AI in Shaping the Modern Education System (2025).
- 7.Lee, A. V. Y., Koh, E. & Looi, C. K. AI in education and learning analytics in Singapore: An overview of key projects and initiatives. Inf. Technol. Educ. Learn.3(1), Inv-p001 (2023). [Google Scholar]
- 8.Zhou, X. et al. Personalized federated learning with model-contrastive learning for multi-modal user modeling in human-centric metaverse. IEEE J. Sel. Areas Commun.42(4), 817–831 (2024). [Google Scholar]
- 9.Soman, G., Judy, M. V. & Abou, A. M. Human guided empathetic AI agent for mental health support leveraging reinforcement learning-enhanced retrieval-augmented generation. Cogn. Syst. Res.90, 101337 (2025). [Google Scholar]
- 10.Xia, B., Innab, N., Kandasamy, V., Ahmadian, A. & Ferrara, M. Intelligent cardiovascular disease diagnosis using deep learning enhanced neural network with ant colony optimization. Sci. Rep.14(1), 21777 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Lateef, M. Harnessing AI and machine learning to elevate educational wearable technology. In Wearable Devices and Smart Technology for Educational Teaching Assistance, 53–80. (IGI Global Scientific Publishing, 2025).
- 12.Rayudu, K. M., Chinnammal, V., Rubiston, M. M., Padmaloshani, P., Singaravelu, R. & Merlin, N. R. G. Experimental analysis of artificial intelligence powered adaptive learning methodology using enhanced deep learning principle. In 2024 International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES), 1–7 (IEEE, 2024).
- 13.Salloum, S. A., Alomari, K. M., Alfaisal, A. M., Aljanada, R. A. & Basiouni, A. Emotion recognition for enhanced learning: using AI to detect students’ emotions and adjust teaching methods. Smart Learn. Environ.12(1), 21 (2025). [Google Scholar]
- 14.Vistorte, A. O. R. et al. Integrating artificial intelligence to assess emotions in learning environments: A systematic literature review. Front. Psychol.15, 1387089 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Liu, Y. et al. Sample-cohesive pose-aware contrastive facial representation learning. Int. J. Comput. Vis.133(6), 3727–3745 (2025). [Google Scholar]
- 16.Zhang, X., Cheng, X. & Liu, H. TPRO-NET: an EEG-based emotion recognition method reflecting subtle changes in emotion. Sci. Rep.14(1), 13491 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Meng, T. et al. A multi-message passing framework based on heterogeneous graphs in conversational emotion recognition. Neurocomputing569, 127109 (2024). [Google Scholar]
- 18.Xie, Y., Yang, L., Zhang, M., Chen, S. & Li, J. A review of multimodal interaction in remote education: Technologies, applications, and challenges. Appl. Sci.15(7), 3937 (2025). [Google Scholar]
- 19.Sangeetha, S. K. B., Immanuel, R. R., Mathivanan, S. K., Cho, J. & Easwaramoorthy, S. V. An empirical analysis of multimodal affective computing approaches for advancing emotional intelligence in artificial intelligence for healthcare. IEEE Access12, 114416–114434 (2024). [Google Scholar]
- 20.Li, C., Weng, X., Li, Y. & Zhang, T. Multimodal learning engagement assessment system: An innovative approach to optimizing learning engagement. Int. J. Hum. Comput. Interact.41(5), 3474–3490 (2025). [Google Scholar]
- 21.Khediri, N., Ben Ammar, M. & Kherallah, M. A real-time multimodal intelligent tutoring emotion recognition system (MITERS). Multimed. Tools Appl.83(19), 57759–57783 (2024). [Google Scholar]
- 22.Sajja, R., Sermet, Y., Cikmaz, M., Cwiertny, D. & Demir, I. Artificial intelligence-enabled intelligent assistant for personalized and adaptive learning in higher education. Information15(10), 596 (2024). [Google Scholar]
- 23.Chetry, K. K. Transforming education: How AI is revolutionizing the learning experience. Int. J. Res. Publ. Rev.5(5), 6352–6356 (2024). [Google Scholar]
- 24.Zhang, X. et al. Smart classrooms: How sensors and AI are shaping educational paradigms. Sensors (Basel, Switzerland)24(17), 5487 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Govea, J., Navarro, A. M., Sánchez-Viteri, S. & Villegas-Ch, W. Implementation of deep reinforcement learning models for emotion detection and personalization of learning in hybrid educational environments. Front. Artif. Intell.7, 1458230 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Yadav, Uma, and Urmila Shrawankar. "Artificial Intelligence Across Industries: A Comprehensive Review With a Focus on Education." AI Applications and Strategies in Teacher Education (2025): 275–320.
- 27.Marques-Cobeta, N. Artificial intelligence in education: Unveiling opportunities and challenges. In Innovation and Technologies for the Digital Transformation of Education: European and Latin American Perspectives, 33–42 (2024).
- 28.Gan, W., Dao, M. S., Zettsu, K. & Sun, Y. IoT-based multimodal analysis for smart education: Current status, challenges, and opportunities. In Proceedings of the 3rd ACM Workshop on Intelligent Cross-Data Analysis and Retrieval, 32–40 (2022).
- 29.Zhou, X., Xuesong, Xu., Liang, W., Zeng, Z. & Yan, Z. Deep-learning-enhanced multitarget detection for end–edge–cloud surveillance in smart IoT. IEEE Internet Things J.8(16), 12588–12596 (2021). [Google Scholar]
- 30.Halkiopoulos, C. & Gkintoni, E. Leveraging AI in e-learning: Personalized learning and adaptive assessment through cognitive neuropsychology—A systematic analysis. Electronics13(18), 3762 (2024). [Google Scholar]
- 31.Duan, S., Wang, Z., Wang, S., Chen, M. & Zhang, R. Emotion-aware interaction design in intelligent user interface using multi-modal deep learning. In 2024 5th International Symposium on Computer Engineering and Intelligent Communications (ISCEIC), 110–114 (IEEE, 2024).
- 32.Sharma, K., Papamitsiou, Z. & Giannakos, M. Building pipelines for educational data using AI and multimodal analytics: A “grey-box” approach. Br. J. Edu. Technol.50(6), 3004–3031 (2019). [Google Scholar]
- 33.Villegas-Ch, W., Gutierrez, R. & Mera-Navarrete, A. Multimodal emotional detection system for virtual educational environments: Integration into microsoft teams to improve student engagement. IEEE Access13, 42910–42933 (2025). [Google Scholar]
- 34.Li, Y., Chai, Z., You, S., Ye, G. & Liu, Q. Student portraits and their applications in personalized learning: Theoretical foundations and practical exploration. Front. Digit. Educ.2(2), 1–17 (2025). [Google Scholar]
- 35.Javed, S., Ezehra, S. R., Ullah, H. & Naveed, M. How AI can detect emotional cues in students, improving virtual learning environments by providing personalized support and enhancing social-emotional learning. Rev. Appl. Manag. Soc. Sci.8(2), 665–682 (2025). [Google Scholar]
- 36.Zong, Y. & Yang, L. How AI-enhanced social–emotional learning framework transforms EFL students’ engagement and emotional well-being. Eur. J. Educ.60(1), e12925 (2025). [Google Scholar]
- 37.Thirunagalingam, A. & Whig, P. Emotional AI integrating human feelings in machine learning. In Humanizing Technology With Emotional Intelligence, 19–32 (IGI Global Scientific Publishing, 2025).
- 38.Annapareddy, V. N., Singireddy, J., Nanan, B. P. & Burugulla, J. K. R. Emotional Intelligence in Artificial Agents: Leveraging Deep Multimodal Big Data for Contextual Social Interaction and Adaptive Behavioral Modelling (2025).
- 39.Zhang, F., Wang, X. & Zhang, X. Applications of deep learning method of artificial intelligence in education. Educ. Inf. Technol.30(2), 1563–1587 (2025). [Google Scholar]
- 40.Kolhatin, A. O. From automation to augmentation: a human-centered framework for generative AI in adaptive educational content creation. In CEUR Workshop Proceedings, 143–195 (2025).
- 41.Sajja, R., Sermet, Y., Cwiertny, D. & Demir, I. Integrating AI and learning analytics for data-driven pedagogical decisions and personalized interventions in education (2023). https://arxiv.org/abs/2312.09548.
- 42.Parkavi, R., Karthikeyan, P. & Abdullah, A. S. Enhancing personalized learning with explainable AI: A chaotic particle swarm optimization-based decision support system. Appl. Soft Comput.156, 111451 (2024). [Google Scholar]
- 43.Cheng, S., Liu, Q., Chen, E., Huang, Z., Huang, Z., Chen, Y., Ma, H. & Hu, G. DIRT: Deep learning enhanced item response theory for cognitive diagnosis. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2397–2400 (2019).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The dataset is available from the corresponding author upon individual request.





















