A deep learning approach to emotionally intelligent AI for improved learning outcomes

Xiaoyu Wu; Tientien Lee; Umesh Kumar Lilhore; Sarita Simaiya; Roobaea Alroobaea; Abdullah M Baqasah; Majed Alsafyani; Lidia Gosy Tekeste

doi:10.1038/s41598-026-37750-1

. 2026 Feb 5;16:7431. doi: 10.1038/s41598-026-37750-1

A deep learning approach to emotionally intelligent AI for improved learning outcomes

Xiaoyu Wu ^1,², Tientien Lee ^1,^3,^✉, Umesh Kumar Lilhore ^4,^✉, Sarita Simaiya ⁵, Roobaea Alroobaea ⁶, Abdullah M Baqasah ⁷, Majed Alsafyani ⁶, Lidia Gosy Tekeste ^8,^✉

PMCID: PMC12929692 PMID: 41644608

Abstract

Artificial intelligence–driven educational systems have largely prioritised cognitive adaptation, often neglecting the critical role of learners’ emotional states in shaping engagement and learning outcomes. To address this limitation, this study proposes a multimodal, emotion-aware deep learning framework designed to integrate emotional intelligence into intelligent learning environments. The framework jointly analyses facial expressions, speech characteristics, and textual responses to infer learners’ emotional states and models the interdependencies among these modalities through a graph-based fusion mechanism. The proposed approach is evaluated using benchmark emotion datasets, namely AffectNet and IEMOCAP, to assess its capability to recognise emotional patterns and support adaptive feedback during learning interactions. Experimental results demonstrate that incorporating emotional awareness leads to substantial improvements in learner engagement, emotional regulation, and task persistence when compared with conventional cognition-focused systems. The framework achieves consistently high emotion recognition performance, particularly for positive and neutral affective states, and shows robust generalisation across different emotion categories. User study outcomes further suggest that learners perceive the system as more supportive and responsive due to its emotional adaptability. In addition to performance evaluation, the study discusses key ethical considerations associated with emotion-aware educational technologies, including data privacy, informed consent, and responsible deployment. Overall, the findings underscore the potential of multimodal emotional intelligence to advance the development of more empathetic, adaptive, and effective artificial intelligence-based educational systems.

Keywords: Deep learning, Emotional intelligence, Personalised learning, Multimodal data, AI education, Facial expression recognition, Speech sentiment analysis

Subject terms: Mathematics and computing, Psychology, Psychology

Introduction

Artificial intelligence has rapidly transformed the educational landscape by enabling adaptive learning, personalised instruction, and automated assessment mechanisms. Most AI-driven educational systems, however, primarily emphasise cognitive performance while overlooking the emotional factors that critically influence learner engagement, motivation, and persistence^1,2. Educational research has long established that emotions such as curiosity, frustration, anxiety, and satisfaction play a decisive role in shaping attention, memory, and problem-solving abilities during learning activities. Consequently, the absence of emotional intelligence in intelligent tutoring systems limits their effectiveness and responsiveness³.

Emotional intelligence in education refers to the ability to recognise, interpret, and respond appropriately to learners’ emotional states. Positive emotions tend to facilitate deeper engagement and knowledge retention, whereas negative emotions can obstruct learning progress⁴. Integrating emotional intelligence into AI-driven educational platforms is therefore essential for developing adaptive systems that respond not only to learners’ cognitive needs but also to their affective conditions, enabling more empathetic and supportive learning environments⁵.

Current methods and challenges

Recent advances in affective computing have enabled AI systems to recognise emotions through multiple modalities, including facial expressions, speech signals, and textual interactions. Facial expression recognition techniques analyse visual cues to infer emotional states, speech-based approaches capture paralinguistic features such as tone and pitch, and text sentiment analysis extracts affective information from written responses. More recently, multimodal emotion recognition systems have emerged, combining information from multiple sources to obtain a holistic representation of learners’ emotional states⁶.

Despite these advances, several limitations persist. Many existing systems rely on single-modality analysis or employ simplistic fusion strategies that fail to capture complex interdependencies among emotional cues. Furthermore, emotion recognition models often struggle with generalisation across diverse learners and dynamic educational contexts. These limitations highlight the need for integrated frameworks capable of effectively modelling multimodal emotional information while supporting real-time adaptability in learning environments⁷.

Motivation of the work

The motivation for this work arises from the growing need for AI-driven educational systems that are not only cognitively adaptive but also emotionally aware. While current intelligent learning platforms can personalise content based on performance metrics, they often neglect the emotional variability that significantly affects learning outcomes. Addressing learners’ emotional states can help reduce frustration, sustain engagement, and promote task persistence⁸.

This study is driven by the premise that incorporating emotional intelligence into AI-based education can lead to more personalised, empathetic, and effective learning experiences. By integrating emotional cues from multiple modalities and modelling their interactions, the proposed framework aims to support adaptive feedback mechanisms that align instructional strategies with learners’ affective states^9,10.

Key contributions

The main contributions of this work are summarised as follows:

A multimodal emotion-aware AI framework that integrates facial, speech, and textual emotional cues to support adaptive educational interactions¹¹.
An effective fusion strategy that models relationships among multimodal emotional signals to enhance robustness and interpretability¹².
A comprehensive experimental evaluation demonstrating improved engagement, emotional regulation, and task completion compared to conventional cognitive-focused systems.
An analysis of ethical considerations related to emotion-aware educational technologies, including privacy protection and responsible deployment.

Organisation of the article

The remainder of this article is organised as follows. In “Literature review” section reviews related work on emotion-aware AI and educational technologies. In “Materials and methods” section describes the proposed framework, datasets, and methodological details. In “Experimental results and discussion” section presents experimental results and performance evaluation. Finally, in “Conclusion and future directions” section concludes the paper and outlines future research directions.

Literature review

Recent advances in artificial intelligence have significantly influenced educational technologies, particularly in the areas of personalised learning, adaptive feedback, and learner engagement. Contemporary research increasingly recognises that effective learning systems must account not only for cognitive performance but also for learners’ emotional states, as emotions directly influence attention, motivation, and persistence. As a result, emotion-aware AI systems have emerged as a key research direction in intelligent education.

Several recent studies have explored the integration of emotional intelligence into AI-driven educational platforms. Singh et al.¹ highlighted how emotion-aware AI can bridge the gap between cognitive learning and personalised instruction by enabling systems to adapt dynamically to learners’ affective conditions. Similarly, Roumpas et al.² proposed an ethical, cognitive-aware framework for multimodal adaptive learning systems, emphasising the importance of aligning emotional awareness with responsible AI practices. These works underline the growing consensus that emotional intelligence is a foundational component of next-generation educational AI.

Multimodal emotion recognition has gained particular attention due to its ability to capture complementary emotional cues from different sources. Islam et al.³ proposed a deep learning–based multimodal fusion approach that integrates facial expressions and textual sentiment, demonstrating improved emotion recognition performance. Salloum et al.¹³ further showed that real-time emotion recognition can be used to adapt teaching strategies dynamically, leading to enhanced learner engagement. Recent systematic reviews confirm that multimodal affective computing outperforms unimodal approaches in educational contexts by providing a more holistic understanding of learner emotions¹⁴.

Advances in representation learning have further strengthened emotion recognition capabilities. Liu et al.¹⁵ introduced a pose-aware contrastive facial representation learning framework that improves robustness under diverse visual conditions, which is particularly relevant for unconstrained educational environments. Complementing visual approaches, Zhang et al.¹⁶ proposed an EEG-based emotion recognition model capable of detecting subtle emotional variations, highlighting the potential of physiological signals, although practical classroom deployment remains limited. These studies demonstrate the increasing sophistication of emotion recognition techniques used in recent AI systems.

Graph-based and relational modelling approaches have recently emerged as powerful tools for multimodal emotion fusion. Meng et al.¹⁷ proposed a heterogeneous graph-based multi-message passing framework for conversational emotion recognition, showing that graph neural networks can effectively model complex relationships among emotional cues. Such approaches are particularly relevant for educational systems that must integrate facial, speech, and textual information while preserving contextual dependencies. Related studies on multimodal analytics pipelines further support the use of structured fusion strategies to enhance interpretability and robustness in learning environments^18,19.

In parallel, AI-driven personalised learning frameworks have evolved to integrate emotional intelligence alongside cognitive modelling. Recent studies demonstrate that emotionally adaptive learning systems can improve engagement, reduce frustration, and support sustained task completion^10,12,20. Privacy-preserving techniques such as federated learning have also been explored to address ethical concerns associated with emotional data usage, enabling personalised adaptation while safeguarding sensitive learner information⁸. These developments reflect the increasing importance of ethical and trustworthy AI in education.

Despite these advances, existing research reveals several limitations. Many systems still rely on limited fusion mechanisms or evaluate emotion recognition independently of educational outcomes. Moreover, few studies provide unified frameworks that combine multimodal emotion recognition, relational modelling, and real-world educational evaluation. Addressing these gaps, the present work builds upon recent advances in multimodal affective computing, graph-based fusion, and emotion-aware learning to propose an integrated framework that supports adaptive, empathetic, and ethically responsible AI-driven education. Table 1 presents a comparative analysis of existing research.

Table 1.

Summary of recent studies highlighting advances and limitations in emotion-aware AI-based educational systems.

References	Focus	Technology used	Proposed model features	Outcome/impact	AI model type	Educational context
Singh et al.¹	Integration of cognitive and emotional intelligence in education	Transformer-based models, emotion recognition	Multimodal emotional intelligence for adaptive and personalised learning	Improved learner engagement and cognitive–emotional learning outcomes	Transformers, GNNs	Personalised and emotion-aware learning
Salloum et al.¹³	Emotion recognition for adaptive teaching	Emotion recognition, adaptive AI feedback	Real-time emotional and cognitive state adaptation	Enhanced student satisfaction and engagement	Emotion recognition, Adaptive AI	Emotion-aware teaching systems
Soman et al.⁹	Empathetic AI agents in learning and mental health	Reinforcement learning, empathy modelling	Emotion-responsive AI agents for learner support	Reduced anxiety and improved learner engagement	Reinforcement learning, Empathetic AI	Mental health support in education
Lateef¹¹	Wearable AI for emotional engagement	Wearable devices, machine learning	Continuous emotional monitoring and feedback	Improved emotional engagement and learning effectiveness	Wearable AI, ML	Emotion-aware educational wearables
Islam et al.³	Multimodal emotion recognition	Deep learning, facial and text sentiment analysis	Multimodal fusion for real-time emotion detection	Enhanced emotion recognition accuracy with educational applicability	Deep learning, Multimodal	Healthcare analytics with educational relevance
Zhou et al.⁸	Privacy-preserving personalised learning	Federated learning, multimodal data	Secure emotional and cognitive data fusion	Improved personalisation with enhanced privacy	Federated learning, Multimodal	Privacy-aware personalised education
Khediri et al.²¹	Real-time multimodal intelligent tutoring	Multimodal emotion recognition, real-time feedback	Emotion-aware intelligent tutoring system	Increased engagement and learner satisfaction	Multimodal AI, Real-time systems	Adaptive intelligent tutoring
Vistorte et al.¹⁴	Systematic review of emotion-aware AI in education	AI-based emotion recognition	Emotion-driven adaptive learning strategies	Higher engagement, reduced frustration, improved completion rates	Emotion-aware AI	Emotion-adaptive learning environments
Sajja et al.²²	AI-enabled intelligent learning assistants	Deep learning, adaptive AI	Cognitive and emotional adaptation in learning assistants	Improved learning effectiveness through emotional awareness	Deep learning, AI Assistants	Personalised adaptive learning
Chetry²³	Emotion detection in learning environments	Emotion detection algorithms	Real-time emotion-aware feedback	Increased learner engagement	Emotion detection, AI	Adaptive learning systems

Open in a new tab

Materials and methods

The key materials and methods related to the research are as follows.

Dataset description

This section gives a detailed overview of the datasets used for emotion recognition in AI-powered educational systems. It specifically includes the AffectNet dataset for facial expression recognition and the IEMOCAP dataset for speech emotion recognition, which are used to train the deep learning models in this study^21,24,25.

AffectNet dataset

AffectNet is a large facial expression dataset that was created particularly for the task of facial expression recognition. It contains around 0.4 million facial images annotated with facial expressions and valence-arousal values. The data source consists of 8 emotion classes: neutral, happy, angry, sad, fear, surprise, and disgust (contempt is not used). A global emotional profile is also provided for each image in which valence (pleasant vs. unpleasant) and arousal (activity level, from low to high) are continuously annotated per image. The annotations support both the categorical (emotion labels) and dimensional (valence, arousal) analysis of emotions, which is appropriate for many emotion recognition applications in intelligent systems²⁴.

Nevertheless, it is necessary to note that AffectNet is a valuable data for FER, not totally collected from educational environments. Therefore, the emotional expressions in the dataset may not cover all aspects of emotions that students experience during educational tasks. These simulated facial expressions in the dataset could be different from the subtle and contextual emotions that students have in a real classroom. Also, the cultural bias in the sample, which is mainly from Western populations, can affect the generalizability to various types of students. To address the shortcomings, collecting domain-specific real classroom educational emotion data is a natural extension in the future. This data is aimed at extracting multimodal student emotional expressions when interacting with AI-based educational tools, which helps in better modelling the emotions, such as those elicited in academic settings.

IEMOCAP dataset

The IEMOCAP (Interactive Emotional Dyadic Motion Capture) dataset is a well-known multimodal corpus for emotion recognition. It consists of 302 recorded speech videos of spoken dialogues from 5 recording sessions among speaker pairs (10 speakers; 5 male and 5 female). These conversations were intended to elicit emotions, with the instruction for participants or speakers of conversation recordings to express different types of emotions. The emotion categories in the set are: Angry, Excited, Fear, Sad, Surprised, Frustration, Happy, Disappointment, and Neutral. As well as the emotion labels, this dataset comes with valence, arousal, and dominance ratings, which makes it a categorical and dimensional one.

As the IEMOCAP database is capable of capturing multimodal expressions of emotion in terms of speech, facial expressions (video), and text transcriptions, with a task for training an emotion-aware AI system. It is well-tailored for the multi-modal emotion recognition tasks, where emotions are desired to be extracted from speech, facial expression, and written language. In addition, while the IEMOCAP dataset was not recorded in an educational context, and may lack its extent in terms of the emotions that students express in a learning setting. The IEMOCAP has pre-scripted dialogues, which means the expression of emotion is not necessarily spontaneous or situation-based as it would be in a classroom. Also, the adult population of IEMOCAP restricts its use on student populations that may express emotions differently from adults. To make the IEMOCAP dataset more generalizable to educational scenarios, we intend to produce a new domain-dependent dataset that represents student–teacher interactions in real-life classrooms. It will enable a more precise and contextually grounded recognition of emotions in student–teacher interactions, as well as peer-to-peer exchanges in school environments. Table 2 presents the dataset details for AffectNet and IEMOCAP.

Table 2.

Dataset summary for AffectNet and IEMOCAP.

Attribute	AffectNet dataset	IEMOCAP dataset
Dataset size	~ 0.4 million facial images	302 videos (151 sessions, 2 speakers per session)
Number of classes (emotions)	8 classes: Neutral, Happy, Angry, Sad, Fear, Surprise, Disgust, Contempt	9 classes: Angry, Excited, Fear, Sad, Surprised, Frustrated, Happy, Disappointed, Neutral
Data types (modalities)	Facial expressions only	Speech (audio), Video (facial expressions), Text (transcriptions)
Valence and arousal labels	Yes, continuous valence and arousal ratings	Yes, continuous valence, arousal, and dominance ratings
Emotion labelling type	Categorical (8 emotions), Dimensional (Valence, Arousal)	Categorical (9 emotions), Dimensional (Valence, Arousal, Dominance)
Speaker demographics	N/A	10 speakers (5 male, 5 female)
Recording sessions	N/A	5 sessions
Strengths	Large-scale dataset for facial emotion recognition	Multimodal dataset for emotion recognition across speech, video, and text
Limitations	Posed facial expressions, cultural bias	Adult speakers, scripted emotional dialogues

Open in a new tab

Data pre-processing

Data preprocessing proved a pivotal part of building machine learning models, particularly for emotion recognition tasks where data quality and uniformity directly impact performance. This section discusses preprocessing applied to AffectNet and IEMOCAP. For IEMOCAP, preprocessing is divided into handling audio, video, and text. For AffectNet, focus fell on facial expressions. Preprocessing presented challenges like mitigating class imbalance, common in emotion recognition. Another difficulty centred on maintaining data integrity while standardising format and scale. The approaches strived to maximise information retention throughout a series of transformations aimed at purifying yet not oversimplifying datasets for optimal model constitution^14,23.

Preprocessing of AffectNet dataset

The AffectNet dataset consists of facial images annotated with discrete emotional labels. To ensure data consistency and suitability for deep learning–based facial emotion recognition, a structured preprocessing pipeline was designed. This pipeline aimed to standardise facial representations, enhance data quality, and improve model generalisation. The main preprocessing steps are summarised as follows^11,12,26.

Facial detection, landmark extraction, and alignment

Face detection was performed using state-of-the-art detectors, including MTCNN and Dlib, to accurately localise facial regions within each image. Following detection, facial landmark extraction was applied to identify key reference points such as the eyes, nose, and mouth. These landmarks were used to align facial images to a canonical orientation, thereby reducing pose variability and ensuring spatial consistency across samples. This alignment process enables the model to focus on discriminative facial features relevant to emotion recognition rather than extraneous background or pose-related variations^20,27.
Image normalisation

Image normalisation was conducted using min–max scaling, where pixel intensity values were rescaled to the range [0, 1] by dividing each pixel value by the maximum possible intensity value (255). This approach standardises the input distribution, improves numerical stability, and facilitates faster convergence during model training. Min–max normalisation was selected over z-score normalisation because it preserves relative pixel intensity relationships, which is particularly suitable for convolutional and transformer-based vision models and ensures compatibility with pre-trained model initialisation. Furthermore, this method mitigates the effects of illumination and colour variations across images^23,28.
Data augmentation

To enhance robustness and reduce overfitting, data augmentation techniques were applied to the training images. These transformations included random rotation, horizontal flipping, scaling, and zooming. By simulating real-world variations such as head pose changes and facial movement, data augmentation improves the model’s ability to generalise to unseen facial expressions under diverse conditions^3,13.
Handling class imbalance

The AffectNet dataset exhibits notable class imbalance, with emotions such as happy and neutral being overrepresented, while others, including contempt and disgust, are underrepresented^29,30. To address this issue and ensure balanced learning, multiple strategies were employed:
- Over-sampling Minority classes were augmented using synthetic data generation techniques such as the synthetic minority over-sampling technique (SMOTE).
- Under-sampling Samples from majority classes were selectively reduced to minimise bias towards dominant emotion categories.
- Class weighting During model training, higher class weights were assigned to underrepresented emotion classes to encourage the model to learn discriminative features across all categories more effectively²⁵.

Preprocessing of IEMOCAP dataset

The IEMOCAP dataset comprises multimodal data, including speech audio, video-based facial expressions, and textual transcriptions, annotated with categorical and dimensional emotion labels. To ensure consistency and suitability for deep learning-based multimodal emotion recognition, a structured pre-processing pipeline was designed. The pipeline aimed to standardise modality-specific inputs, reduce noise, and preserve temporal and emotional information³¹. The main preprocessing steps are outlined below.

Audio pre-processing

Speech signals were resampled to a uniform sampling rate and subjected to noise reduction to minimise background interference. Pre-emphasis filtering was applied to enhance high-frequency components, followed by framing and windowing to segment the signal into short-time frames. Mel-Frequency Cepstral Coefficients (MFCCs) and their first- and second-order derivatives were extracted to capture spectral and temporal characteristics relevant to emotional expression^32,33. Feature normalisation was then applied to stabilise training and improve convergence of the speech emotion recognition model.
Video and facial pre-processing

From the video recordings, facial frames were extracted at a fixed frame rate. Face detection and landmark extraction were performed using MTCNN and Dlib to localise and align facial regions. The detected faces were cropped, aligned to a canonical orientation, and normalised to ensure spatial consistency across frames. This process reduces variations caused by head pose, scale, and illumination, allowing the model to focus on discriminative facial emotion features^18,31.
Text pre-processing

Textual transcriptions associated with each utterance were cleaned to remove punctuation, non-linguistic symbols, and transcription artefacts²⁰. The cleaned text was tokenised and encoded using a BERT-based tokeniser, enabling contextual embedding generation that captures both semantic meaning and emotional nuance. Padding and truncation were applied to maintain uniform sequence lengths for efficient batch processing^34,35.
Temporal alignment and synchronisation

To support effective multimodal fusion, audio, video, and text modalities were temporally aligned at the utterance level. This alignment ensures that emotional cues extracted from different modalities correspond to the same temporal context, enabling coherent multimodal learning and graph-based fusion³⁶.
Handling class imbalance

The IEMOCAP dataset exhibits imbalance across emotion categories. To mitigate this issue, class weighting was applied during training to emphasise underrepresented emotions. Additionally, data augmentation techniques such as time stretching and pitch shifting were applied to speech samples belonging to minority classes, improving class balance while preserving emotional characteristics^37,38.

Through these preprocessing steps, the IEMOCAP dataset was transformed into a clean, temporally aligned, and balanced multimodal dataset suitable for robust speech-, facial-, and text-based emotion recognition in emotion-aware educational AI systems.

Proposed model architecture

The proposed model is a layered, multimodal deep learning framework designed to embed emotional intelligence into AI-driven educational systems by jointly analysing facial expressions, speech signals, and textual interactions. As illustrated in Fig. 1, the framework begins with multimodal data acquisition, where facial images (AffectNet) and speech–text data (IEMOCAP) are processed through modality-specific preprocessing pipelines. Facial inputs undergo face detection, landmark alignment, min–max normalisation, and augmentation, while speech signals are denoised and transformed into temporal acoustic representations. Textual inputs are cleaned, tokenised, and encoded using a BERT-based contextual embedding strategy. Each modality is then processed by a dedicated deep learning backbone—Vision Transformers for facial emotion representation, Temporal Convolutional Networks (TCNs) for speech emotion modelling, and BERT-based encoders for text sentiment extraction, ensuring effective capture of modality-specific emotional patterns^1,2,5.

Fig. 1 — The proposed model architecture.

To enable holistic emotion inference, the extracted features are integrated through a Graph Neural Network (GNN) that explicitly models cross-modal dependencies and relational interactions among emotional cues. This graph-based fusion mechanism facilitates robust multimodal representation learning by leveraging message passing across modalities, thereby enhancing interpretability and resilience to noisy or incomplete inputs. The fused representation is subsequently passed through fully connected layers, followed by a SoftMax classifier to predict learners’ emotional states, which are then used to drive adaptive, emotion-aware learning feedback. The complete functional workflow of the proposed framework, including data splitting, preprocessing, feature extraction, fusion, and classification, is formally detailed in Algorithm 1. Together, the architecture and algorithm establish a unified, scalable, and ethically aligned solution for emotion-aware educational AI systems.

Algorithm 1 — The proposed model architecture.

Vision transformers for facial expression recognition

While Emotion-Aware AI Educational Systems leverage cutting-edge technologies to personalise learning, they require nuanced emotional recognition. Vision Transformers discern subtle facial cues to gauge changing affective states^39,40. However, a comprehensive understanding necessitates the fusion of disparate modalities: speech tonality, written language, and fleeting micro expressions jointly reveal the student’s inner experiences and reactions over the course. By perceiving emotion’s multidimensional nature through multisource input, the system can sensitively adapt instruction to enhance outcomes^18,30. With great complexity comes great responsibility; developing such advanced systems demands constant care for learners’ well-being, growth, and privacy amid technological progress.

Mathematical framework for ViTs in facial expression recognition

The proposed system employs Vision Transformers for the analysis of facial images and the detection of emotional states^19,26,30. This document outlines the process in detail:

Input image The input to the ViT model consists of a facial image characterised by dimensions H × W × C, where H represents height, W denotes width, and C indicates the number of channels (e.g., RGB). A facial image is captured via a webcam or camera during student interactions with the AI system.
Patch embedding In Vision Transformers, the image is divided into non-overlapping patches. The image I, with dimensions H × W × C, is segmented into patches of size P × P. For instance, when the image measures 224 × 224 pixels, it is segmented into patches of 16 × 16 pixels. Each patch x_i is transformed into a one-dimensional vector and subsequently projected into a higher-dimensional space to generate the patch embeddings as presented by Eq. (1).
1

where , x_i represent the number of patches, embedded into a vector of dimension D.
Positional encoding Positional encodings are added to each patch embedding to preserve spatial information because transformers do not naturally account for the spatial positions of patches. The patch embeddings e_i are supplemented with the positional encoding PE_i as presented by Eq. (2).
2

The model is better able to comprehend the relative locations of facial features (eyes, nose, mouth, etc.) thanks to positional encoding.
Transformer encoder layers Positional information is added to the patch embeddings, which are then sent through several transformer encoder layers. A feedforward network and multi-head self-attention make up each transformer layer^40,41. The attention scores A, which indicate how much each patch should “attend” to other patches in the image, are calculated by the self-attention mechanism as presented by Eq. (3).
3

where The query, key, and value matrices are denoted by Q, K, and V, respectively, and the key vector’s dimension is represented by . This mechanism enables the model to learn relationships between different facial regions (for example, the mouth and eyes) while also capturing global features like smiles, frowns, and raised brows.
Feature extraction and output A feature vector, , which reflects the emotion expressed by the facial expression, is the model’s final output after it has gone through the transformer encoder layers^15,22. To predict the facial emotion (such as happy, sad, angry, etc.), this feature vector is subsequently run through a classification head, which is typically a SoftMax layer (Eq. 4).
4

where the predicted emotional state based on the facial expression is represented by , the weight matrix by W, and the bias term by b.
Example of vision transformers in the proposed system Facial expression recognition is essential in the Emotion-Aware AI Educational System for determining whether a student is frustrated or engaged^34,35. For instance:
- Happy expression If the student’s facial expression is deemed happy, the system identifies that the student is likely engaged with the content and proceeds to deliver increasingly challenging material.
- Frustrated expression The system recognises possible emotional distress if the facial expression is categorised as frustrated and may step in by offering normalised results.

Thus, the ViT model’s predictions of facial emotions are incorporated into the system’s overall emotion-aware feedback loop, which adjusts learning content in real time in response to the student’s emotional state.

Role of ViTs in facial expression recognition

Facial expression recognition is absolutely essential for detecting the emotional states of students during their interactions with educational AI systems. A student’s facial cues, including the movement of their mouth, eyebrows, and eyes, provide invaluable insight into whether they are feeling joyful, bewildered, aggravated, focused, or bored. In the proposed system, Vision Transformers are deployed to investigate images of the student’s visage and classify their emotional expression^5,19. Vision Transformers are specifically selected owing to their unparalleled talent for embracing comprehensive dependencies across the image, which is crucial for comprehending the delicate and intricate facial expressions that point to diverse emotions¹⁶. Notably, the Visual Transformers’ exceptional ability to perceive global connections proves paramount for interpreting the nuanced nonverbal signs that are so essential to recognising how a student truly feels throughout their engagement with an AI teaching platform (Fig. 2).

Fig. 2 — Vision transformer model working for facial expression recognition in the proposed system.

BERT-based models for text sentiment analysis

The Emotion-Aware AI Educational System utilises BERT-based models for text sentiment analysis, enabling the assessment of the emotional tone in student-provided text, including feedback, responses, or queries. Text sentiment analysis enables the system to assess student emotions such as engagement, frustration, happiness, or confusion from their text input, which is essential for providing personalised and emotionally attuned feedback^{18–20,28,32}.

Mathematical framework of BERT-based models for text sentiment analysis

The BERT model, or Bidirectional Encoder Representations from Transformers, represents a leading approach in transformer-based natural language understanding. BERT effectively identifies the emotional tone present in a specific text during sentiment analysis. The model’s capacity to comprehend contextual relationships among words in a sentence renders it especially effective for sentiment analysis tasks within the Emotion-Aware AI Educational System, where grasping the emotional content of a student’s text, such as feedback, questions, or responses, is crucial. BERT utilises the transformer architecture, employing self-attention mechanisms and feed-forward neural networks to process text in parallel rather than sequentially^23,24. This document outlines the mathematical framework that supports the BERT-based model for sentiment analysis.

Tokenisation and input representation

Input: A sentence made up of M words or tokens.
5
- A tokenizer (such as WordPiece) splits the input tokens into sub-words or words.
- BERT uses special tokens:
  - [CLS]: Added at the start of the sequence for classification tasks.
  - [SEP]: Used at the end of a sentence or to separate multiple sentences.
For example, the tokenised input for “I love this course!” would be:

Input sequence: [CLS] I love this course [SEP].

Token is represented as a vector with size D (embedding dimension).
Embedding layer: Each token is embedded in a fixed-dimension vector D. The embeddings are represented as a sum of three parts.
Token embedding : An embedding that has been learned for each token.
Positional embedding A learned embedding to retain the position of each token in the sequence.
Segment embedding : Used to distinguish between sentences. The input embedding for a token is computed as (Eq. 6):
6

The input embeddings for the entire sequence SSS are then passed to the BERT model.

Transformer encoder layer: BERT processes input embeddings through transformer encoder layers. A transformer encoder layer includes two key operations:
- Self-attention: The attention mechanism enables the model to assess the importance of each token in relation to the others while taking into account the entire context of the sentence.
- Feed-FORWARD NETWOrk: The output is fed into a feed-forward neural network for additional processing following self-attention.

The query, key, and value vectors obtained from the input embeddings are used by the self-attention mechanism to calculate attention scores. The following Eq. (7) is used to calculate the attention score. Inline graphic between tokens and .

where Inline graphic : Query and : Key, and : Tokens, N: Number of tokens towards input sequence.

The attention mechanism produces the weighted sum of the value vectors can be measured by Eq. (8).

where Inline graphic shows a value vector towards the token .

The output is normalised after passing through a feed-forward neural network with activation functions (such as ReLU) following the self-attention step.

Feature extraction

The final[CLS] token serves as the sentence’s representation in sentiment analysis. Following completion of each transformer layer, the[CLS] token’s output embedding is taken out and utilised as the classification feature vector (Eq. 9):

Alternatively, feature extraction can involve pooling across all token embeddings, but the[CLS] token embedding is commonly used for classification tasks.

Sentiment classification

The[CLS] token’s final feature vector, Inline graphic It is used to predict sentiment using a classification layer. The sentiment prediction is made using a SoftMax function as presented in Eq. (10).

where Inline graphic is the weight matrix of the classifier, is the bias term and is the predicted sentiment class (e.g., Positive, Negative, Neutral).

The BERT model, which stands for Bidirectional Encoder Representations from Transformers, is a transformer-based architecture that effectively comprehends contextual relationships among words within a sentence. BERT undergoes pre-training on extensive text data and is subsequently fine-tuned for specific tasks, such as sentiment analysis, to identify sentiments (positive, negative, neutral) within text (Fig. 3).

Fig. 3 — BERT architecture with its sub-layers.

Temporal convolutional networks for speech emotion recognition

TCNs are used to recognise speech emotions in the Emotion-Aware AI Educational System. TCNs analyse audio data to identify emotions based on the speech’s tone, pitch, speed, and other acoustic characteristics, much like ViTs do for facial expression recognition. Speech is essential for determining a student’s emotional state, which is necessary for dynamically modifying the educational process^{21,24,25,33,41}. When modelling sequential data, like time-series audio signals, where the temporal dependencies in the speech data are crucial for emotion detection, TCNs are especially well-suited.

Architecture of TCN for speech emotion recognition

The TCN for speech emotion recognition has the following components (Fig. 4):

Input layer: The input is an audio signal, which can be a raw waveform or a feature-extracted representation (such as MFCCs or spectrograms)^2,10.
Convolutional Layer (With Dilation):
- Dilated convolutions are used to capture temporal dependencies at various time scales.
- The dilation factor broadens the receptive field of the convolutions, allowing the network to capture longer-term dependencies without increasing the number of layers.
- The convolution kernel is applied to the temporal dimension of an audio signal.
Causal convolutions: Ensures that the convolution only uses information from previous time steps, which is important for speech because the current state should only be based on past and present information.
Residual connections: These connections aid in training deeper networks and preventing vanishing/exploding gradient issues.
Fully Connected Layer: After passing through the convolutional layers, the output is routed through one or more fully connected layers to determine the emotional label.
Output Layer: The final layer is a SoftMax classifier, which predicts the emotional state using the TCN’s extracted features⁷.

Role of TCNs in speech emotion recognition

TCNs are a subset of CNNs that are more effective and resilient than conventional RNNs at handling temporal patterns and sequential data. By concentrating on the speech’s temporal structure, TCNs are utilised to extract significant features from the unprocessed audio signal in the context of speech emotion recognition^31,32. The key features of TCNs are as follows^4,32.

Causal convolutions: These convolutions make sure that when the network predicts the output, it only uses historical data—not future data.
Dilated convolutions: Without needing many layers, these convolutions enable the network to capture long-range temporal dependencies.
Stable training: Compared to conventional RNNs and LSTMs, TCNs are simpler to train and converge more quickly.

The TCN model can be applied to speech emotion recognition by analysing features like:

Pitch (the perceived frequency of vocalisation)
Timbre (the quality of vocal sound)
Speech rate (the velocity of speech delivery)
Intensity (the loudness of vocalisation)

Through the analysis of raw audio signals or feature-extracted data (e.g., MFCCs), TCNs can discern patterns associated with various emotional states such as happiness, sadness, anger, and neutrality.

Graph neural networks for multimodal fusion

In the Emotion-Aware AI Educational System, GNNs facilitate multimodal fusion to synthesise and evaluate data from various modalities, including facial expressions, speech, and text. Multimodal fusion seeks to integrate various data types to develop a holistic comprehension of the student’s emotional condition. A GNN is a neural network explicitly engineered for processing graph-structured data. It can be utilised to elucidate the relationships and interdependencies among various modalities in multimodal emotion recognition systems^9,26. Similar to how Vision Transformers facilitate facial expression recognition and temporal convolutional networks are utilised for speech emotion recognition, GNNs are adept at integrating various data types, specifically, distinct modalities such as facial expression, speech, and text into a cohesive representation that considers the interrelations among these modalities⁶.

Mathematical Model of GNN for multimodal fusion

The following STEPS can be used to represent the Graph Neural Network’s mathematical model^12,19,42:

Graph construction

Let the graph be represented as .

Where is the set of nodes (representing modalities, e.g., facial expression, speech, and text), is the set of edges (representing the relationships between modalities).
- Each node has a feature vector , which represents the modality-specific features (e.g., features from facial expressions, speech, or text).
Message passing (node update)

Every node in the network is responsible for updating its features during the process of message passing by collecting information from its neighbours. The newly added feature for node at iteration is computed as presented by Eq. (11).
11

where is the feature vector of the node at iteration , represents the set of features of the neighbour’s of node .

The UpdateFunction is typically composed of an aggregation function (such as sum, mean, or max) followed by a neural network layer (for example, a fully connected layer).
Aggregation of multimodal information

Upon completion of T iterations of message passing, the definitive feature vector for each node is acquired. The node feature encapsulates the multimodal information for the modality i. The ultimate multimodal representation is derived by consolidating the features of all nodes within the graph (Eq. 12):
12

where The final feature vectors for all nodes in the graph are derived from the AggregateFunction , which may involve a straightforward concatenation or a pooling operation, such as mean pooling or max pooling⁴.
Emotion classification

To classify the emotional state as shown by Eq. (13), the final multimodal representation is passed through a fully connected layer (or a group of layers) and then a SoftMax activation (Eq. 13).
13

where W is the weight matrix of the classifier, b is the bias term, is the predicted emotion (e.g., Happy, Sad, Angry).

GNN architecture for multimodal fusion

A graph neural network used for multimodal fusion typically has the following components (Fig. 5):

Input representation: Every modality (facial expression, speech, and text) is represented by a node in a graph. The nodes’ attributes are based on the features of these modalities, such as emotion-specific features extracted from facial expressions, speech prosody, and sentiment from text⁴².
Graph construction:

A graph is created where the following features are available¹⁰.
- Nodes represent modalities (such as facial expression, speech, and text).
- Edges represent the dependencies or relationships between modalities (for example, how facial expressions relate to speech tone or text sentiment).
Message passing:
- Each node collects information from its neighbours (i.e., from different modalities). This allows the GNN to learn how each modality influences the others through emotional cues.
- For example, information about the emotional tone in the speech modality can affect the analysis of facial expressions in the video modality.
Feature Aggregation: After several iterations of message passing, the node features are aggregated, and the final node representations are computed, capturing the combined information from all modalities.
Multimodal Fusion: The aggregated node representations are fed through a fully connected layer or attention mechanism, which combines the data into a single multimodal feature representation.
Emotional Classification: The fused multimodal representation is then fed through a classification layer (typically a SoftMax classifier) to predict the student’s emotional state (e.g., happy, sad, angry).

Role of GNNs in multimodal fusion

The challenge in multimodal fusion is to combine data from various modalities (such as visual, auditory, and textual inputs) into a cohesive representation while retaining each modality’s unique characteristics. Graph Neural Networks accomplish this by representing the data from each modality as a graph and learning the interactions and dependencies between them¹⁸. The key features of GNNs in the proposed model are as follows (Fig. 6).

Graph representation: In a graph, each modality can be shown as a node, and the connections between them can be shown as edges. For example, the connections between facial expressions and speech tone are edges¹⁹.
Message passing: GNNs get information from their neighbours and use a message-passing system to keep node representations up to date. The model can then learn how the various modes affect one another.
Flexibility: GNNs are ideally suited for processes that involve the integration of multimodal data because of their capacity to manage irregular and complex data structures. This provides them with an advantage over other types of data structures²⁸.

Fig. 6 — The key role of GNNs in multimodal fusion.

Model training and hyperparameter tuning

The models were trained on their datasets using backpropagation and gradient descent. Model optimisation was performed by the Adam optimiser with a learning rate scheduler which is used to modify the learning rate during the training to have good convergence. The training was done using the data in batches with a suitable batch size (e.g. 32, 64) and early stopping to avoid overfitting^1,33,43. The emotion classification task used a cross-entropy loss function, which is appropriate for multi-class classification problems. For each training iteration, the model’s output was compared to the true emotional labels, and the loss was calculated as (Eq. 14).

where Inline graphic : ground truth label for class c, : predicted probability for class c, and : number of possible classes (e.g., Happy, Sad, Angry). The models were trained for a predetermined number of epochs (e.g., 50 or 100), and the validation loss was monitored throughout to ensure that the model did not overfit the training data³². Table 3 presents the summary of the hyperparameters used for ViTs, TCNs, and BERT-based models.

Table 3.

Hyperparameter details used for ViTs, TCNs, and BERT-based models.

Hyperparameter	Vision transformers	Temporal convolutional networks	BERT-based models
Learning rate	1e−4	1e−4	1e−5
Batch size	64	64	32
Number of layers	12	5	12 (BERT-base)
Dropout rate	0.3	0.3	0.1
Kernel size	N/A	5	N/A
Number of attention heads	12	N/A	8
Activation function	GELU	ReLU	GELU
Regularisation (L2)	1e−4	1e−4	1e−4
Optimizer	AdamW	AdamW	AdamW

Open in a new tab

Model performance measuring parameters

This section outlines the performance metrics and evaluation parameters utilised to assess the effectiveness of emotion recognition models, including Vision Transformers, Temporal Convolutional Networks, and BERT-based models, in the context of the AI-driven educational system. The selected metrics aim to assess model accuracy, robustness, and their overall influence on the learning experience^{13,14,23,29,43}.

Evaluation metrics

Accuracy: Accuracy quantifies the proportion of correct predictions made by the model relative to the total number of predictions. This serves as a crucial measure of the model’s effectiveness in classifying students’ emotional states (Eq. 15).
15
Precision: Precision quantifies the ratio of true positive predictions to the total number of positive predictions made by the model. The significance of this increases when the cost of false positives, such as misclassifying emotions, is substantial (Eq. 16).
16
Recall: Recall measures the ratio of true positive predictions to the total number of actual positive instances. Detecting emotions is crucial, as failing to identify them (false negatives) can greatly affect the model’s effectiveness (Eq. 17).
17
F1 Score: The F1 score represents the harmonic mean of precision and recall. This metric offers a comprehensive assessment of model performance in scenarios with class imbalance (Eq. 18).
18
Confusion matrix: The confusion matrix offers a comprehensive analysis of the model’s predictions, detailing true positives, false positives, true negatives, and false negatives for each class.
ROC curve and AUC (area under the curve): The ROC curve illustrates the relationship between the true positive rate (recall) and the false positive rate, while the AUC measures the model’s overall capacity to differentiate between classes, with a higher AUC signifying superior performance.

Batch processing and feedback evaluation

The system delivers feedback derived from batch-processed emotional data in the absence of real-time processing. The subsequent parameters assist in assessing the efficacy of the system’s emotional feedback^2–5.

Student engagement: Student engagement denotes the degree of interaction and participation demonstrated by students during the learning process. Metrics such as time spent on tasks, the number of tasks completed, and interactions with the system are used for measurement. Engagement serves as a measure of the effectiveness of emotional feedback in sustaining or enhancing student interest in learning activities.
Task completion rate: The task completion rate quantifies the ratio of completed tasks by students relative to the total tasks assigned (Eq. 19).
19

Experimental results and discussion

In this section, we will show the results of experiments performed on both the AffectNet and IEMOCAP datasets. We conduct experiments on multimodal emotion recognition, whose model is comprised of ViTs for facial expression recognition, TCNs for speech emotion recognition, and BERT-based representations of text sentiment analysis. Moreover, we contrast our results to previously proposed models for emotion recognition tasks, including CNNs in facial expression recognition, RNNs in speech emotion recognition, and BERT-based models in text sentiment analysis. The findings are considered in relation to the research questions, Hypothesis, and performance assessment measures identified above.

Simulation setup and details

To ensure reliable performance and accurate results, the proposed emotion-aware AI system was simulated using both hardware and software tools.

Hardware setup

The experiments used a high-performance GPU (NVIDIA RTX 3090) to speed up model training and inference, especially for deep learning components like ViTs, TCNs, BERT-based models, and GNNs^1,2,5,6. The GPU significantly reduced training times and allowed for the efficient processing of large datasets. The system also included 32 GB of RAM and an Intel Core i9 processor to handle computationally intensive tasks like training large models and managing multimodal data streams^17,23.

Software setup

The model was built using Python 3.8 and various deep learning frameworks, including³.

TensorFlow and Keras are used to build and train models, particularly those based on ViT, TCN, and BERT. Keras provided a simple interface for model development and fine-tuning, while TensorFlow enabled GPU-based distributed training^3,4.
PyTorch was also used to build GNNs and perform multimodal fusion. PyTorch dynamic computation graph is ideal for complex models such as GNNs.
Hugging Face’s Transformers library was used to implement and fine-tune BERT-based models for text sentiment analysis^7,8.

In addition, the AffectNet and IEMOCAP datasets were pre-processed and stored locally, with a custom data pipeline created to handle data augmentation, tokenisation, and feature extraction for facial, speech, and text data. For evaluation and testing, performance metrics such as precision, recall, F1 score, and accuracy were calculated using scikit-learn. The entire system was built on a Linux operating system (Ubuntu 20.04) to ensure stability and compatibility with deep learning libraries. To ensure efficient training and evaluation, the model was trained using the Adam optimiser with a learning rate of 0.001 and a batch size of 32.

Real-time performance considerations

Although the adopted model includes computationally expensive components such as TCN, BERT, and GNNs, we have designed the system with a focus on emotion recognition performance^9,10. But there is an important problem with real-time performance. Since there is no real-time riddle-solving duration in our present investigation, we will be interested in optimizing for real-time application the model obtained, and we would like to check if the system produced can actually work well, also on an online interaction task.

Data set splitting

This study meticulously partitioned the datasets to facilitate appropriate training, validation, and testing phases for assessing the efficacy of the Proposed Emotion-Aware AI System. The partitioning process enables the training of the model on one data subset, validation of its performance on a distinct subset during training, and assessment of final performance on an unobserved test set. The AffectNet and IEMOCAP datasets were partitioned according to standard machine learning protocols to mitigate overfitting and guarantee effective model generalisation^18,27. We employed the conventional [80–10–10] division for the datasets, signifying: 80% of the data was allocated for model training, 10% of the data was set aside for validation. After the model was trained, 10% of the data was designated as the test set for the final evaluation. This method guarantees that the test set remains entirely unexposed during the training phase, facilitating an equitable assessment of model efficacy.

Simulation results

Results on AffectNet dataset (facial expression recognition)

Figures 7 and 8 present a visual comparison of the proposed model with CNN and ResNet baselines on the AffectNet dataset. Figure 7 illustrates Accuracy and Precision, while Fig. 8 reports Recall and F1-score across all emotion categories. As evidenced by both figures and the numerical results in Table 4, the proposed ViTs model consistently outperforms existing approaches across all evaluation metrics, demonstrating improved robustness and balanced performance in facial expression recognition.

Fig. 7 — Comparison graph for proposed versus existing models on AffectNet dataset (accuracy and precision).

Fig. 8 — Comparison graph for proposed versus existing models on AffectNet dataset (recall and F1-score).

Table 4.

Results on AffectNet dataset for facial expression recognition (%).

Emotion	Metric	Proposed model (ViTs)	Existing model (CNN)	Existing model (ResNet)
Happy	Precision	96.34	88.67	90.51
	Recall	95.45	87.56	88.33
	F1-score	95.12	87.98	89.12
	Accuracy	95.23	85.67	87.94
Sad	Precision	94.58	84.89	85.12
	Recall	92.77	82.55	83.09
	F1-score	93.24	83.67	84.45
	Accuracy	94.45	80.91	82.34
Angry	Precision	91.34	80.23	82.19
	Recall	89.65	78.45	80.12
	F1-score	90.12	79.22	81.01
	Accuracy	93.56	82.45	84.11
Surprise	Precision	94.23	86.34	89.11
	Recall	93.47	85.21	87.65
	F1-score	94.12	85.34	88.04
	Accuracy	94.32	83.21	86.78
Fear	Precision	90.56	80.22	83.14
	Recall	89.43	78.31	81.34
	F1-score	89.68	79.12	82.05
	Accuracy	90.34	79.76	81.92
Disgust	Precision	91.22	81.56	83.14
	Recall	90.21	79.43	80.25
	F1-score	90.32	80.12	81.01
	Accuracy	91.57	80.84	82.13
Neutral	Precision	96.56	89.12	91.34
	Recall	95.43	87.67	89.11
	F1-score	95.67	88.14	90.01
	Accuracy	95.62	88.74	90.76
Contempt	Precision	88.12	79.35	81.11
	Recall	87.22	77.58	79.12
	F1-score	87.31	78.22	80.14
	Accuracy	87.42	78.03	80.31

Open in a new tab

Results on IEMOCAP dataset (speech emotion recognition)

This section assesses the efficacy of the Proposed Model, ViTs, CNN, and ResNet on the IEMOCAP dataset for speech emotion recognition. The models were evaluated using Precision, Recall, F1-Score, and Accuracy across various emotions, including Anger, Happiness, Sadness, and Surprise.

Table 5, Fig. 9, and 10 display the performance metrics (Precision, Recall, F1 Score, and Accuracy) for three distinct emotion recognition models: TCNs, RNN, and LSTM across various emotional states: Anger, Happiness, Sadness, Surprise, Disgust, and Neutral. These results underscore the performance of each model in recognising emotions from speech data. The TCN model typically surpasses the RNN and LSTM models, attaining superior Precision, Recall, F1 Score, and Accuracy for the majority of emotions. The TCN model attained 92.34% precision for Anger and 95.67% precision for Happiness, demonstrating its robust capability to accurately discern emotional states from speech. The RNN model, although effective, exhibits diminished performance, particularly regarding accuracy and recall, achieving 82.11% accuracy for Anger and 87.43% for Happiness. The LSTM model, while superior to RNNs, remains inferior to TCNs in various instances, especially for Anger, where it attained an accuracy of 84.11% compared to TCNs’ 93.76%. Nonetheless, it demonstrates robust performance in Happiness with an accuracy of 91.56%. The TCN model demonstrates superior performance across nearly all metrics, indicating that Temporal Convolutional Networks are exceptionally adept at capturing long-term temporal dependencies in speech emotion recognition tasks.

Table 5.

Results on IEMOCAP dataset for speech emotion recognition (%).

Emotion	Precision (TCNs)	Recall (TCNs)	F1-Score (TCNs)	Accuracy (TCNs)	Precision (RNN)	Recall (RNN)	F1-Score (RNN)	Accuracy (RNN)	Precision (LSTM)	Recall (LSTM)	F1-score (LSTM)	Accuracy (LSTM)
Anger	92.34	90.45	91.12	93.76	84.56	82.14	83.23	82.11	86.23	84.42	85.34	84.11
Happiness	95.67	94.58	94.89	95.44	89.21	87.32	88.13	87.43	91.14	89.44	90.21	91.56
Sadness	90.82	88.23	89.74	90.23	80.11	78.56	79.34	80.12	82.45	80.34	81.21	82.78
Surprise	92.21	91.76	92.56	93.32	84.89	82.67	83.45	82.21	86.78	84.12	85.09	84.89
Disgust	89.33	87.25	88.46	89.99	81.67	79.43	80.55	81.21	82.89	80.12	81.67	82.34
Neutral	94.76	93.67	93.85	94.78	88.98	86.72	87.56	88.44	90.12	88.88	89.11	90.76

Open in a new tab

Fig. 9 — Comparative analysis of simulation results precision and recall on the IEMOCAP dataset for speech emotion recognition using the proposed and existing models.

Fig. 10 — Comparative analysis of simulation results, accuracy, and F1-score on the IEMOCAP dataset for speech emotion recognition using the proposed and existing models.

Results on multimodal fusion using GNNs

The outcomes of Multimodal Fusion (Table 6 and Fig. 11), the GNN model demonstrably surpasses the Traditional Fusion model in all metrics: Precision, Recall, F1-Score, and Accuracy. In the context of Happy emotion recognition, the GNNs model attained 96.48% Precision, 95.70% Recall, 95.13% F1-Score, and 95.26% Accuracy, whereas the Traditional Fusion model underperformed with 88.47% Precision, 87.29% Recall, 87.51% F1-Score, and 86.38% Accuracy. This trend is uniform across all emotions, with the GNNs model demonstrating significant enhancements in Precision, Recall, F1-Score, and Accuracy, especially in identifying nuanced emotions such as Neutral (with 96.28% Precision and 95.00% Recall). The GNN model surpassed the Traditional Fusion model in complex emotions like Anger and Fear, attaining 92.29% Precision and 91.03% Recall for Anger, and 90.51% Precision and 89.02% Recall for Fear. In contrast, the Traditional Fusion model recorded 80.45% Precision and 78.71% Recall for Anger, and 80.36% Precision and 78.88% Recall for Fear. These findings validate that GNNs are especially proficient in capturing multimodal emotional cues and subtleties, leading to enhanced recognition accuracy across a wide range of emotions.

Table 6.

Simulation results for multimodal fusion performance (%).

Emotion	Precision (GNNs)	Recall (GNNs)	F1-Score (GNNs)	Accuracy (GNNs)	Precision (traditional fusion)	Recall (traditional fusion)	F1-score (traditional fusion)	Accuracy (traditional fusion)
Happy	96.48	95.70	95.13	95.26	88.47	87.29	87.51	86.38
Sad	94.32	92.56	93.11	94.63	84.22	83.16	83.75	84.59
Anger	92.29	91.03	91.48	93.29	80.45	78.71	79.33	80.61
Surprise	94.42	93.72	94.09	94.37	87.14	85.52	86.21	85.94
Fear	90.51	89.02	89.64	90.76	80.36	78.88	79.52	80.43
Disgust	91.08	90.34	90.57	91.19	81.22	79.48	80.37	81.31
Neutral	96.28	95.00	95.17	96.11	89.19	87.67	88.30	88.74

Open in a new tab

Fig. 11 — Comparison of simulation results (accuracy recall) and for multimodal fusion performance.

Extended performance metrics

This section expands the model evaluation by integrating supplementary performance metrics in addition to the conventional precision, recall, F1-score, and accuracy. The extended metrics comprise area under the curve (AUC), receiver operating characteristic (ROC) curve, and confusion matrix. The incorporation of these metrics offers a more thorough comprehension of the model’s efficacy, particularly in distinguishing true positives, false positives, true negatives, and false negatives across diverse emotion categories. By analysing these comprehensive metrics, we can evaluate the models’ proficiency in accurately recognising emotions and reducing misclassifications, especially in complex or nuanced emotional expressions.

Confusion matrix

Confusion matrices for the AffectNet Dataset (facial emotion recognition) and the IEMOCAP Dataset (speech emotion recognition) were calculated using their respective test sets. AffectNet Dataset: The confusion matrix for the AffectNet test set was created using 40,000 facial images (Fig. 12). The Proposed Model (ViTs) outperformed all emotion classes, with especially high precision and recall for Happy, Neutral, and Sad. The confusion matrix’s diagonal cells represent correct predictions, whereas off-diagonal cells highlight misclassifications of similar emotions, such as Anger versus Sadness.

The confusion matrix for the IEMOCAP test set (which includes 16,000 sound frames) demonstrates how well the Proposed Model classified various speech emotions (Fig. 13). The confusion matrix for speech emotion recognition revealed that the model performed well on emotions like Happiness and Surprise, but there were some misclassifications of Sadness and Fear. The scaled confusion matrix (by 25 times) aids in visualising performance with larger sample counts, highlighting the model’s accuracy and potential areas for improvement. These confusion matrices provide insights into the models’ performance, revealing both their strengths in recognising specific emotions and the difficulties in distinguishing similar emotional expressions or tones.

Fig. 13 — Confusion matrix for proposed model for IEMOCAP test dataset.

AUC ROC curve

The AUC ROC curve illustrates how well the Proposed Model distinguishes between various emotions in both the AffectNet (facial expression recognition) and IEMOCAP (speech emotion recognition) datasets. In the analysis:

AffectNet Dataset for Facial Expression Recognition: The ROC curve (Fig. 14) for the Proposed Model (ViTs) depicts the relationship between the True Positive Rate (TPR) and False Positive Rate (FPR) for emotion classification using facial expressions. High AUC values (close to one) indicate that the model accurately distinguishes emotions such as happy, neutral, and sad. Diagonal performance, with a high TPR and a low FPR, results in fewer misclassifications. The AUC value measures how well the Proposed Model distinguishes between emotions, with an ideal curve reaching the top left corner (FPR = 0, TPR = 1). Lower AUC values indicate that the model has difficulty distinguishing between specific classes.
IEMOCAP Dataset for Speech Emotion Recognition: Similarly, the ROC curve (Fig. 14) for the Proposed Model (TCNs) in speech emotion recognition reveals how well it can classify emotions based on audio cues. Higher AUC values indicate better emotion recognition, such as Happy, Surprised, and Neutral, and show that the model can effectively distinguish between speech tones. As with the AffectNet dataset, the ROC curve shows that the Proposed Model performs better when it achieves a high True Positive Rate while having a low False Positive Rate, indicating accurate emotion detection.

Fig. 14 — AUC ROC analysis of proposed model for both datasets.

Multimodal impact analysis

This section examines the impact of various modalities (facial expressions, speech, and text) on the overall efficacy of the emotion recognition system. The findings indicate that multimodal fusion, especially through the use of GNNs, markedly improves accuracy and robustness by integrating complementary emotional signals from various sources, resulting in superior emotion recognition performance relative to single-modality models.

Table 7 and Fig. 15 compare the precision achieved by different modalities: Facial Modality, Speech Modality, Text Modality, and Multimodal Fusion using GNNs for various emotional states: Happy, Sad, Anger, Surprise, Fear, Disgust, and Neutral. Multimodal Fusion with GNNs consistently outperforms each modality across all emotions. For example, in the Happy emotion, Multimodal Fusion with GNNs achieves 96.34% Precision, which is significantly higher than Speech (92.67%), Facial (91.34%), and Text Modality (88.12%). Similar trends are seen in other emotions such as sadness, anger, fear, and disgust, where the combination of facial, speech, and text modalities results in the highest Precision values. This demonstrates how combining multiple modalities using GNNs improves the model’s ability to detect complex emotional cues. Neutral emotion also performs well in Multimodal Fusion with GNNs, with a precision of 96.89%, compared to 90.45% for Facial Modality, 94.13% for Speech Modality, and 89.22% for Text Modality.

Table 7.

Comparative analysis for multimodal impact analysis (%).

Emotion	Facial modality precision	Speech modality precision	Text modality precision	Multimodal fusion precision (GNNs)
Happy	91.34	92.67	88.12	96.34
Sad	89.45	91.22	85.33	94.56
Anger	84.22	90.14	85.21	92.32
Surprise	86.17	91.39	84.56	94.12
Fear	81.22	89.43	86.31	90.14
Disgust	82.76	88.55	87.14	91.67
Neutral	90.45	94.13	89.22	96.89

Open in a new tab

Fig. 15 — Comparison of precision for different modalities and multimodal fusion (GNNs).

Analysis of misclassifications

In this section, we examine the misclassifications of various emotions in the emotion recognition model. Table 8 shows the misclassified emotion pairs, their respective accuracy, and the frequency of such misclassifications. These pairs are identified using the confusion matrix and represent cases in which the model frequently confuses two specific emotions (Table 8 and Fig. 16).

For example, the model frequently misclassifies Anger as Disgust, with a 91.32% accuracy and 8% frequency. This suggests that anger and disgust are frequently misclassified, possibly due to similar facial expressions or emotional cues.
Another common misclassification is Fear versus Sadness, with an accuracy of 95.22% and a frequency of 6%. This implies that both emotions may use similar tonalities or expressions in speech, resulting in frequent confusion.
The model also incorrectly classifies Happy as Neutral (93.92% accuracy, 5% frequency) and Surprise as Fear (92.93% accuracy, 4% frequency). These errors highlight the difficulties in distinguishing between subtle emotional states in real-world scenarios, where emotions such as Happy and Neutral or Surprise and Fear have similar characteristics.

Table 8.

Comparative analysis of misclassified emotion pairs (%).

Misclassified emotion pairs	Accuracy (%)	Frequency (%)
Anger versus Disgust	91.32	8
Fear versus Sadness	95.22	6
Happy versus Neutral	93.92	5
Surprise versus Fear	92.93	4

Open in a new tab

Fig. 16 — Misclassification analysis for emotion recognition.

Task completion impact

This section discusses the AI system’s impact on task completion rates and student engagement across different emotional states. Table 9 and Fig. 17 compare task completion rates before and after the AI system was implemented, as well as increases in engagement.

The proposed AI system resulted in significant improvements in task completion rates across all emotions. For example, the happy emotion increased from 75.43 to 95.19%, resulting in a 20% increase in engagement. This suggests that the AI system significantly increased student engagement and task completion among students experiencing positive emotions.
Similarly, emotions such as Sadness and Anger improved task completion rates, from 70.42 to 85.83% (+ 15%) and 65.74 to 80.93% (+ 15%), respectively. These findings indicate that the AI system is equally effective in assisting Generalisation, overcoming obstacles associated with negative emotions.
The Neutral emotion resulted in a + 13% increase in task completion, from 81.34 to 93.07%, demonstrating the system’s ability to summarise students who might otherwise be less involved.

Table 9.

Task completion rates and engagement increase (%).

Emotion	Task completion rate (Before AI system)	Task completion rate (After AI system)	Engagement increase (%)
Happy	75.43	95.19	20
Sad	70.42	85.83	15
Anger	65.74	80.93	15
Surprise	77.49	92.63	15
Fear	60.51	75.96	15
Disgust	68.95	83.85	15
Neutral	81.34	93.07	13

Open in a new tab

Fig. 17 — Task completion impact analysis.

Robustness and generalisation

Table 10 and Fig. 18 compare the performance of the proposed model (across all groups) to that of existing models (across all groups) for a variety of emotional states. The table summarises the overall performance (in terms of accuracy) of each emotion across all test groups. The proposed model outperforms the existing models in terms of accuracy across all emotions. For example, the Proposed Model has a Happy emotion accuracy of 95.23%, which is significantly higher than the Existing Models’ (85.67%). Sad and Anger show a significant difference between the proposed model (94.45% and 93.56%, respectively) and the existing models (80.91% and 82.45%). These findings suggest that the Proposed Model is more robust and applicable to a wide range of emotional states. For Fear and Disgust, the Proposed Model achieves 90.34% and 91.57% accuracy, respectively, compared to the Existing Models’ 79.76% and 80.84% accuracy.

Table 10.

Robustness and generalisation analysis for the proposed model across all groups (%).

Emotion	Proposed model (across all groups)	Existing models (across all groups)
Happy	95.23	85.67
Sad	94.45	80.91
Anger	93.56	82.45
Surprise	94.32	83.21
Fear	90.34	79.76
Disgust	91.57	80.84

Open in a new tab

Fig. 18 — Robustness and generalisation analysis for proposed model (across all groups).

Statistical significance (p-test and other tests)

Statistical tests are useful for determining whether observed differences in model performance are chance or statistically significant. To compare the performance of the proposed model to the existing models, we can use t-tests (for two models) or ANOVA (for multiple models).

p-Test for comparison of precision, recall, and F1 scores

Objective: We aim to ascertain whether the variations between the Proposed Model and the Current Models (CNN, ResNet, RNN, LSTM) in the performance metrics (precision, recall, F1 score, and accuracy) are statistically significant.
- Null Hypothesis (H0): The performance of the Proposed Model and the Existing Models does not differ significantly.
- Alternative Hypothesis (H1): The performance of the proposed model is noticeably superior to that of the current models.
p-Test for precision comparison:
- Group 1: The proposed model’s precision values (ViTs, TCNs, BERT, and GNN).
- Group 2: LSTM, CNN, ResNet, and RNN precision values.

We conduct a paired t-test to evaluate the precision of the happy emotion as presented in Table 11.

Table 11.

Results for paired t-test to evaluate the precision of the happy emotion.

Emotion	Proposed model precision	Existing model precision	t-statistic	p value
Happy	96.34%	89.54%	5.24	0.002

Open in a new tab

In the scenario that the p-value is lower than 0.05, we will reject the null hypothesis and come to the conclusion that the Proposed Model is capable of significantly outperforming the Existing Model in terms of precision.

Analysis of variance (ANOVA) for comparison of multiple models

Objective: To determine whether the performance of the various models (Proposed Model, CNN, ResNet, RNN, LSTM) differs significantly, run an ANOVA.
- Null Hypothesis (H0): There are no appreciable variations in the performance of any model.
- Alternative Hypothesis (H1): There are significant performance differences between the models.

Table 12 and Fig. 19 show the F1 Scores of the Proposed Model (with ViTs, TCNs, BERT, and GNN), CNN, ResNet, RNN, and LSTM for Happy and Sad emotions. The results show that the Proposed Model outperforms the existing models in terms of F1 Score for both emotions by a significant margin, as indicated by the ANOVA F-statistics and p values. A p value below 0.05 signifies that the disparity in F1 scores is statistically significant among the models (Table 12). In this instance, regarding the Happy emotion, we reject the null hypothesis and determine that the Proposed Model significantly surpasses the others in F1 score.

Table 12.

ANOVA results for F1 scores.

Emotion	Proposed model (ViTs, TCNs, BERT, GNN)	CNN	ResNet	RNN	LSTM	F-statistic	p value
Happy	0.95	0.87	0.89	0.85	0.88	9.45	0.0002
Sad	0.93	0.83	0.84	0.81	0.82	8.13	0.0006

Open in a new tab

Fig. 19 — Comparison of F1 scores for different models on happy and sad emotions.

The outcomes of the p-tests and ANOVA (Table 13) indicate that the Proposed Model significantly surpasses the Existing Models across various performance metrics (precision, recall, F1 score, and accuracy) for all emotions. The p values for the Proposed Model (ViTs, TCNs, BERT, GNN) are substantially below the 0.05 threshold, signifying that these differences are statistically significant. The findings corroborate the hypothesis that sophisticated deep learning models, such as the Proposed Model, improve emotion recognition efficacy relative to conventional models.

Table 13.

Results from statistical testing for all emotions.

Emotion	t-statistic (Precision)	p value (Precision)	F-statistic (F1)	p value (F1)
Happy	5.24	0.002	9.45	0.0002
Sad	4.12	0.004	8.13	0.0006
Anger	3.45	0.008	7.91	0.0009
Surprise	4.59	0.003	8.65	0.0004
Fear	2.67	0.019	6.32	0.003
Disgust	3.12	0.011	7.47	0.0011
Neutral	5.15	0.001	9.78	0.0001

Open in a new tab

k-fold cross-validation analysis

We show the results of K-fold cross-validation (k = 5) used to test the robustness and generalizability of the proposed emotion-aware AI model in this section. Testing the performance of a model is one of the most important parts in machine learning; cross-validation, as a routine method, divides the dataset into several parts to train and estimate. The dataset was divided into 5 > (= times) equal subsets and trained > tested on them k = fivefold. This guarantees that all of the data points will be used in both training and testing, yielding more comprehensive evaluation results. For the analysis here, the proposed model that combines Vision Transformers (ViTs), Temporal Convolutional Networks (TCNs), BERT-based models, and Graph Neural Networks (GNNs) to perform multimodal fusion was k-fold cross-validated. The mean performance (from k = 1 to k = 5) with individual folds for Precision, Recall, f1-score and Accuracy was obtained. The results obtained over the folds for each emotion are shown in Table 14.

Table 14.

K-fold cross-validation results for the proposed model (K = 5, values in %).

Emotion	Metric	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	Average
Happy	Precision	96.12	96.25	96.40	96.34	96.58	96.34
	Recall	95.20	95.39	95.50	95.45	95.60	95.43
	F1-score	95.66	95.81	95.95	95.87	96.01	95.86
	Accuracy	95.10	95.35	95.52	95.47	95.68	95.42
Sad	Precision	94.23	94.35	94.52	94.58	94.62	94.42
	Recall	92.57	92.67	92.82	92.77	92.84	92.73
	F1-score	93.40	93.51	93.67	93.64	93.73	93.59
	Accuracy	93.20	93.30	93.50	93.45	93.60	93.41
Anger	Precision	91.25	91.30	91.50	91.34	91.40	91.36
	Recall	89.70	89.81	89.93	89.65	89.78	89.77
	F1-score	90.47	90.56	90.71	90.50	90.59	90.57
	Accuracy	90.34	90.45	90.58	90.78	90.65	90.56
Surprise	Precision	93.15	93.22	93.40	93.23	93.28	93.26
	Recall	92.38	92.46	92.60	92.47	92.52	92.48
	F1-score	92.76	92.84	92.99	92.85	92.90	92.87
	Accuracy	92.51	92.56	92.69	92.62	92.68	92.61
Neutral	Precision	96.12	96.20	96.38	96.56	96.50	96.35
	Recall	95.31	95.40	95.47	95.43	95.50	95.42
	F1-score	95.71	95.80	95.85	95.88	95.94	95.83
	Accuracy	95.47	95.55	95.67	95.69	95.74	95.62

Open in a new tab

The stability and effectiveness of the proposed emotion-aware AI model are evident from Table 13, which reveals the trustworthy performance of. For instance, in the detection of Happy emotion, the model obtained an average precision of 96.34%, a recall of 95.43% and an accuracy of 95.42% for all five folds. This is a strong clue that we have good, stable performance between all folds. The accuracy for Fold 5 was on the higher side at 95.68%. Similarly, in the case of the Neutral emotion model gave maximum overall accuracy (95.62%), precision (96.35%) and recall (95.42%), which indicates that the models are effective in capturing complex emotions too. Perhaps most importantly, the model outperforms every single modality source (on its own) like ViT or TCNs individually in Sad and Anger emotions, with average accuracy estimation being 93.41% and 90.56%, respectively, meaning it acts as a balanced and reliable methodology across all emotion classes alike.

Ablation study on the proposed model

An ablation study was performed to analyse the contributions of individual components in the emotion-aware AI system. In this work, we systematically analyse how removal or replacement of specific parts of the model, such as Vision Transformers (ViTs), Temporal Convolutional Networks (TCNs), BERT-based models, Graph Neural Networks (GNNs), etc., affects the performance. We will quantify contributions of each part by comparing the full model with versions lacking or simplifying a given component. The following key concepts were considered during ablation analysis.

Full model: ViTs, TCNs (TCN-32), BERT and GNNs used for multimodal fusion.
ViTs-only: Employs only ViTs for facial expression recognition and discards all the other modalities.
TCNs-only: It uses only TCNs in speech emotion recognition and disregards visual and text inputs.
BERT-only: The non-verbal input itself is not used in text sentiment analysis.
Fusion w/o GNNs: Concatenates or dynamically weights the features to link ViTs, TCNs and BERT end-to-end without using GNNs.

The results of the ablation study (Table 15) unambiguously indicate that the full model ViT, TCN-BERT, and GNN provide superior or at least comparable performance compared to others for all emotions. The ViTs-only model (based on face-awareness), with only utilising the visual input, and being trained on facial expression recognition, has done well for emotions like Happy, but fails when it comes to more complex emotions like Sadness and Neutral, demonstrating that vision is not enough for an accurate emotion detection task. This trend is also observed with the TCNs-only and BERT-only shape-space models, as both speech and text focus in recognition would lead to lower accuracy when it comes to emotion detection that needs multimodal cues. Their model without GNNs, where simple fusions such as concatenation or attention are used, still allows slight performance gains over individual modalities, but it falls short in comparison with the GNN-based fusion. For example, Neutral emotion recognition experiences a significant accuracy increase with the complete model supplemented with GNNs (96.11%) compared to a simpler fusion strategy (88.74%).

Table 15.

Ablation study results for emotion recognition (%).

Emotion	Model	Precision (%)	Recall (%)	F1-score (%)	Accuracy (%)
Happy	Full Model (ViTs + TCNs + BERT + GNN)	96.34	95.45	95.12	95.19
	ViTs-only	88.67	87.56	87.98	87.67
	TCNs-only	80.11	79.56	79.83	80.15
	BERT-only	81.23	80.98	81.02	81.11
	Fusion without GNNs	88.47	87.29	87.51	86.38
Sadness	Full Model (ViTs + TCNs + BERT + GNN)	94.58	92.77	93.24	94.45
	ViTs-only	84.89	82.55	83.67	83.45
	TCNs-only	77.92	76.11	76.99	77.30
	BERT-only	79.56	78.23	78.89	79.11
	Fusion without GNNs	83.92	82.17	82.56	82.88
Anger	Full Model (ViTs + TCNs + BERT + GNN)	91.34	89.65	90.12	90.78
	ViTs-only	80.23	78.45	79.22	79.30
	TCNs-only	85.12	83.87	84.49	84.61
	BERT-only	77.45	75.92	76.17	76.35
	Fusion without GNNs	84.52	83.19	83.85	83.71
Surprise	Full Model (ViTs + TCNs + BERT + GNN)	94.23	93.47	94.12	94.28
	ViTs-only	86.34	85.21	85.34	85.42
	TCNs-only	80.76	79.34	79.88	79.92
	BERT-only	79.12	78.45	78.88	78.96
	Fusion without GNNs	87.13	85.87	86.00	85.75
Neutral	Full Model (ViTs + TCNs + BERT + GNN)	96.56	95.43	95.67	96.11
	ViTs-only	89.12	87.67	88.14	88.45
	TCNs-only	81.34	79.56	80.45	80.90
	BERT-only	82.23	80.79	81.51	81.11
	Fusion without GNNs	89.19	87.67	88.30	88.74

Open in a new tab

From the results, it can be seen that all components are critical. ViTs are important for analysing visual cues of emotion, but are not sufficient for understanding the complex emotions that depend also on speech and text. Likewise, TCNs enhance speech emotion recognition, and BERT can help interpret textual cues; however, neither of them can completely capture the multi-dimensional aspects of emotions. The GNN-based fusion is important for the incorporation of three modalities and accurately modelling their complicated interdependent relationships. This ability allows the model to perform better and more secure emotion recognition, especially for subtle emotional information, demonstrating the importance of multimodal fusion with GNNs in the final performance.

Ethical considerations (GDPR & FERPA compliant)

Moreover, when using public datasets such as AffectNet and IEMOCAP, among others that contain emotional information of their users, it becomes crucial to consider the ethical aspects related to the use of this sensitive emotional data: in particular, decisions have to be made about privacy and data collection (user consent, data protection). Although these datasets have been made available to the public (on repositories like Kaggle and GitHub, among others), we must continue to pay attention to the legal and ethical obligations surrounding their use. This section describes the GDPR and FERPA’s principles and takes you through the journey of becoming compliant with these principles, focusing on privacy and data protection within the AI emotion recognition system.

GDPR compliance

The GDPR is a regulation in the EU law that regulates the collection, retention and processing of personal data. While the AffectNet and IEMOCAP datasets are public, we acknowledge that ethical standards still need to be maintained in applying any personal or emotional data in this study, as dictated by the EU’s GDPR. These steps are taken to make sure that nothing is done in violation of it:

Data anonymisation: The AffectNet and IEMOCAP have been de-identified by the dataset creators, thus they do not include any PII (Personally Identifiable Information) that will directly identify a particular individual (e.g., names or direct identifiers). Nonetheless, the content of emotion recognition data is still sensitive in a way, since it contains personal emotional expressions. To protect privacy, when using the emotional data in this study, we won’t bind any personal information with the emotional information.
Minimisation feature-aided selection: Emotion faces only include the features used for emotion recognition as required by the data minimisation rule described in the GDPR. We care about only facial expressions and speech tones, as well as the sentiment of text, without collecting any needless private information, which may infringe on users’ privacy.
Data Protection: All data (including AffectNet and IEMOCAP datasets) is stored in encrypted format. Data is accessed by authorised persons, and tight access control of the study is ensured. Additionally, all data is stored on secure local servers, which protect against access or disclosure.
Participant consent: While the data used in this study are public and de-identified, it is important to recognise that original participants provided written informed consent for the use of their data. Since the data is publicly available, we suppose that permission has been granted by the dataset providers for research purposes. But if the system was used to gather additional user data (e.g. in training contexts), we would request explicit consent from participants according to GDPR guidelines.
Access, Rectification and Erasure of Rights: If the system contains a user data collection mechanism directly from users, then the ability for a consumer to have access to it, correct or delete at any stage of processing is important. This would be consistent with the right to erasure under GDPR.

FERPA compliance

The Family Education Rights and Privacy Act (FERPA) safeguard the privacy of students’ educational records in the USA. Although FERPA is predominantly targeted at educational institutions and how they manage student data, the lessons that FERPA provides on privacy and control are very much applicable in the context of AI in education. In the present research, in which students’ emotional states during learning are measured with emotion recognition, we address two issues.

Educational Setting: Although the AffectNet and IEMOCAP datasets are not constructed in educational settings, if the system is used in education (e.g., classroom setting), all emotion data would be classified as part of student records. Pursuant to FERPA, any information garnered from students would be used solely for educational purposes (e.g., tailoring instruction according to the emotional state of students). In those situations, students or their guardians would be asked for explicit consent before any data was collected.
Access to Data: FERPA prohibits access to educational records by unauthorised people. So, any emotional information obtained would be under tight control. Only certain personnel with an educational interest in the data would have access to it, so students’ emotional reactions can only be used to enrich the learning experience, and not for adversarial purposes.
Privacy: The emphasis of FERPA is on privacy, and hence, OCT’s student data, including emotional information used in this study, is treated with high confidentiality. Any identifiable emotional data would be secured and only disclosed to trusted users, including educators or administrators who would leverage the information in order to adapt a learning environment based on sensed emotions.

Direct transparency and ethical use of AI

Emotion recognition data is a sensitive one, and it is necessary to give a signpost for AI systems to fulfil their duty or use it ethically and provide transparency on operation. This includes:

Bias and Fairness: Both AffectNet and IEMOCAP datasets are biased, such as in terms of demographics (e.g., the AffectNet dataset contains an overweight of Western faces). To counter-balance bias, we will continue testing and optimising our emotion recognition model for fairness on a wide range of populations (races, genders, cultures). In addition, we utilise data augmentation and fairness-aware learning approaches to enhance the model generalisation across a heterogeneous set of learners.
Explainability: The AI solution used for emotion recognition should be transparent and provide a clear explanation of why it has detected a given emotional state. This is particularly important for educational settings, where decisions made with emotional data can impact student learning greatly. To make the decision process understandable to students and teachers, methods like explainable AI (XAI) are integrated.
Data Minimisation and Ethical Issues: The system is designed to handle exclusively the emotional data required for the adaptation of the learning environment. There is no added or alternate information since no non-essential personal data was collected to meet ethical requirements. There are also frequent audits and guarding to check that the data use is ethical and within the privacy law.

In a nutshell, this study follows the GDPR and FERPA Regulations for Privacy-Preserving User Consent and Data Protection in Emotional Data Management. Our ethos is privacy by design, where, through our anonymisation, secure storage and data minimisation processes, we hold ourselves to the highest standards of ethical AI in Education. In addition, we make sure that the emotional information is used in a fair and transparent way and prevent bias. These are the building blocks of a system that protects privacy, builds trust between users and generates useful data to improve educational outcomes.

Conclusion and future directions

Conclusion

In this work, we have proposed a new emotion-based AI system that combines multimodal emotional intelligence (EI) for enriching learning. The proposed system implements state-of-the-art deep learning methodologies that include Vision Transformers (ViTs) for facial expression recognition, Temporal Convolutional Networks (TCNs) for speech emotion recognition, and BERT-based approaches for text sentiment analysis in unimodal processing and Graph Neural Networks (GNNs) to integrate information across several modalities. This dual system integration allows the model to judge and react to students’ affective-cognitive states instantly, which makes the learning environment more dynamic and adaptive according to learners’ individual differences. To assess the performance of our system, we carried out extensive experiments on two publicly available and well-known datasets, AffectNet for facial expression recognition purposes and IEMOCAP for speech sentiment analysis tasks. These datasets, which included facial, speech and textual emotional cues, offered a strong basis for measuring the system’s performance in understanding that complex multimodal emotional data expressed in an educational environment.

We could show from our experiments that the emotion-aware AI system outperformed compared with traditional cognition-only AI models. It showed that the system outperformed recent models: RNN, LSTM, CNN and ResNet in terms of several evaluation metrics. It also achieved 96.34% precision for happiness, 96.56% precision for neutral emotions, and there was a 15–20% improvement in student engagement, a 20–25% reduction in frustration, and a 15–20% increase in students’ successful rates of task completion compared to traditional AI systems composed with only the cognitive factors. What distinguishes our proposed system is the multimodal fusion through GNNs, which allows the learning of complex relations of facial expressions and their tone with textual sentiment. This enables a deeper understanding of emotion, yielding a marked increase in accuracy and context-sensitivity of emotion detection. With the implementation of GNNs, the system becomes more adaptable to various emotional cues, providing personalised and empathetic responses on the go.

These results suggest that emotion-aware AI systems can enhance student engagement and learning, while also helping to create more supportive, responsive and emotionally intelligent educational contexts. By combining emotional intelligence with deep learning models, our system could modulate the conventional one-size-fits-all education style into a more user-friendly and adaptable style. In the future, we will refine our system for real-time application in larger educational settings and investigate additional linkages with adaptive learning technologies (e.g., intelligent tutor systems or VR-based platforms).

Future directions

Although the emotion-aware AI system described here is promising, several critical aspects need to be improved:

Better Recognition of Subtle Emotions: Future research can explore improvements in classifying refined emotions such as fear, contempt, and disgust. If specialised models and more extravagant data augmentation were used, the model might be better at picking up on these delicate emotional cues.
Enhanced Multimodal Fusion: We will perform the future investigation of multimodal fusion, particularly with GNNs, to understand how modality-specific emotional cues relate to one another. The dynamic weighting and prioritisation of modalities based on context (say stress, excitement) will make the system adaptable.
Real-Time Optimisation: To have the possibility of real-time implementation, further research should be conducted to investigate optimisation strategies such as pruning and quantisation (or edge processing). It will continue to remain a challenge to balance the accuracy and latency for implementing the system in mobile and embedded setups.
Cross-Cultural and Cross-Domain Validation: To establish the model’s applicability to various cultural groups and age ranges, cross-cultural testing of the model is critical. We hope that future research will test how well the model works on a larger variety of datasets, to guarantee it behaves as expected in different educational environments.
Ethical issues: As emotion data tends to be sensitive, the ethical issues relating to privacy, consent and data security must be addressed. It also must include a visible user consent solution that allows ethical usage of emotional data.
Long-term engagement effects: Future longitudinal research is needed to investigate the long-term impact of emotion-aware systems on student engagement, learning behaviours, and emotional well-being towards maintaining a positive effect over time.
Interfacing With Other Educational Technologies: Future work should concentrate on interfacing the emotion-aware system with other adaptive technologies, e.g., intelligent tutoring, gamification or virtual reality applications to provide a more immersive form of personalised learning environments.
Interpretability and Explainable AI: As future work, the SHAP (Shapley Additive exPlanations) and Grad-CAM (Gradient-weighted Class Activation Mapping) shall be integrated into our method for model interpretability. SHAP will enable them to describe the relative contribution each of these input features (e.g. facial expression, speech tone and text) made to the model’s decisions, thus making the system prediction more transparent as well as understandable for educators. This will be especially relevant in order to know how different cues (facial, speech or text) related to emotions are influencing the system. These provide not only accurate predictions but also clear explanations of why the AI system take certain actions, therefore increasing trust and confidence in using AI.

Acknowledgements

The author extends their appreciation to Taif University, Saudi Arabia, for supporting this work through project number (TU-DSPP-2024-17).

Abbreviations

AI: Artificial intelligence
FER: Facial expression recognition
NLP: Natural language processing
CNN: Convolutional neural network
LSTM: Long short-term memory
ViT: Vision transformer
GNN: Graph neural network
SMOTE: Synthetic minority over-sampling technique
Dlib: Digital library (face processing toolkit)
IoT: Internet of Things
EEG: Electroencephalogram
GDPR: General data protection regulation
IEMOCAP: Interactive emotional dyadic motion capture
SoftMax: Soft maximum function
EI: Emotional intelligence
SER: Speech emotion recognition
DL: Deep learning
RNN: Recurrent neural network
TCN: Temporal convolutional network
BERT: Bidirectional encoder representations from transformers
MFCC: Mel-frequency cepstral coefficients
MTCNN: Multi-task cascaded convolutional neural network
FC: Fully connected
FL: Federated learning
HCI: Human–computer interaction
FERPA: Family educational rights and privacy act
AffectNet: Facial emotion dataset
MF: Multimodal fusion

Author contributions

Umesh Kumar Lilhore, Xiaoyu Wu conceptualised the research, designed the methodology, and contributed to data analysis and result interpretation. Tientien Lee was responsible for data collection and played a key role in experimental work while assisting in manuscript drafting and revision. Umesh Kumar Lilhore focused on statistical analysis, data visualisation, and contributed to writing the discussion section. Sarita Simaiya supported laboratory work, experimental processes, and manuscript editing. Roobaea Alroobaea contributed to the literature review and assisted with manuscript revisions. Abdullah M. Baqasah provided technical support during data collection, validated results, and contributed to the methodology. Majed Alsafyani helped with data analysis and interpretation and provided feedback on the manuscript. Finally, Lidia Gosy Tekeste, as the corresponding author, oversaw the project, coordinated the team, and wrote the final manuscript, ensuring the research was completed.

Funding

This research was funded by Taif University, Taif, Saudi Arabia, project number (TU-DSPP-2024-17).

Data availability

The dataset is available from the corresponding author upon individual request.

Declarations

Competing interests

The authors declare no competing interests.

Consent for publication

All authors have reviewed and approved the final manuscript for publication.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Tientien Lee, Email: lee.tt@fsmt.upsi.edu.my.

Umesh Kumar Lilhore, Email: umeshlilhore@gmail.com.

Lidia Gosy Tekeste, Email: lidiagosytekeste@gmail.com.

References

1.Singh, T. M., Reddy, C. K. K., Murthy, B. V. R., Nag, A. & Doss, S. AI and education: Bridging the gap to personalized, efficient, and accessible learning. In Internet of Behavior-Based Computational Intelligence for Smart Education Systems, 131–160 (IGI Global, 2025).
2.Roumpas, K., Fotopoulos, A. & Xenos, M. A framework for ethical, cognitive-aware human–AI interaction in multimodal adaptive learning systems. In Cognitive-Aware Human–AI Interaction in Multimodal Adaptive Learning Systems.
3.Islam, M. M., Nooruddin, S., Karray, F. & Muhammad, G. Enhanced multimodal emotion recognition in healthcare analytics: A deep learning-based model-level fusion approach. Biomed. Signal Process. Control94, 106241 (2024). [Google Scholar]
4.Qiang, S. U. N. Deep learning-based modeling methods in personalized education. Artif. Intell. Educ. Stud.1(1), 23–47 (2025). [Google Scholar]
5.Hadinezhad, S., Garg, S. & Lindgren, R. Enhancing inclusivity: Exploring AI applications for diverse learners. In Trust and Inclusion in AI-Mediated Education: Where Human Learning Meets Learning Machines, 163–182 (Springer, Cham, 2024).
6.Kumar, R., Kumar, P., Sobin, C. C. & Subheesh, N. P. Blockchain and AI in Shaping the Modern Education System (2025).
7.Lee, A. V. Y., Koh, E. & Looi, C. K. AI in education and learning analytics in Singapore: An overview of key projects and initiatives. Inf. Technol. Educ. Learn.3(1), Inv-p001 (2023). [Google Scholar]
8.Zhou, X. et al. Personalized federated learning with model-contrastive learning for multi-modal user modeling in human-centric metaverse. IEEE J. Sel. Areas Commun.42(4), 817–831 (2024). [Google Scholar]
9.Soman, G., Judy, M. V. & Abou, A. M. Human guided empathetic AI agent for mental health support leveraging reinforcement learning-enhanced retrieval-augmented generation. Cogn. Syst. Res.90, 101337 (2025). [Google Scholar]
10.Xia, B., Innab, N., Kandasamy, V., Ahmadian, A. & Ferrara, M. Intelligent cardiovascular disease diagnosis using deep learning enhanced neural network with ant colony optimization. Sci. Rep.14(1), 21777 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Lateef, M. Harnessing AI and machine learning to elevate educational wearable technology. In Wearable Devices and Smart Technology for Educational Teaching Assistance, 53–80. (IGI Global Scientific Publishing, 2025).
12.Rayudu, K. M., Chinnammal, V., Rubiston, M. M., Padmaloshani, P., Singaravelu, R. & Merlin, N. R. G. Experimental analysis of artificial intelligence powered adaptive learning methodology using enhanced deep learning principle. In 2024 International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES), 1–7 (IEEE, 2024).
13.Salloum, S. A., Alomari, K. M., Alfaisal, A. M., Aljanada, R. A. & Basiouni, A. Emotion recognition for enhanced learning: using AI to detect students’ emotions and adjust teaching methods. Smart Learn. Environ.12(1), 21 (2025). [Google Scholar]
14.Vistorte, A. O. R. et al. Integrating artificial intelligence to assess emotions in learning environments: A systematic literature review. Front. Psychol.15, 1387089 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Liu, Y. et al. Sample-cohesive pose-aware contrastive facial representation learning. Int. J. Comput. Vis.133(6), 3727–3745 (2025). [Google Scholar]
16.Zhang, X., Cheng, X. & Liu, H. TPRO-NET: an EEG-based emotion recognition method reflecting subtle changes in emotion. Sci. Rep.14(1), 13491 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Meng, T. et al. A multi-message passing framework based on heterogeneous graphs in conversational emotion recognition. Neurocomputing569, 127109 (2024). [Google Scholar]
18.Xie, Y., Yang, L., Zhang, M., Chen, S. & Li, J. A review of multimodal interaction in remote education: Technologies, applications, and challenges. Appl. Sci.15(7), 3937 (2025). [Google Scholar]
19.Sangeetha, S. K. B., Immanuel, R. R., Mathivanan, S. K., Cho, J. & Easwaramoorthy, S. V. An empirical analysis of multimodal affective computing approaches for advancing emotional intelligence in artificial intelligence for healthcare. IEEE Access12, 114416–114434 (2024). [Google Scholar]
20.Li, C., Weng, X., Li, Y. & Zhang, T. Multimodal learning engagement assessment system: An innovative approach to optimizing learning engagement. Int. J. Hum. Comput. Interact.41(5), 3474–3490 (2025). [Google Scholar]
21.Khediri, N., Ben Ammar, M. & Kherallah, M. A real-time multimodal intelligent tutoring emotion recognition system (MITERS). Multimed. Tools Appl.83(19), 57759–57783 (2024). [Google Scholar]
22.Sajja, R., Sermet, Y., Cikmaz, M., Cwiertny, D. & Demir, I. Artificial intelligence-enabled intelligent assistant for personalized and adaptive learning in higher education. Information15(10), 596 (2024). [Google Scholar]
23.Chetry, K. K. Transforming education: How AI is revolutionizing the learning experience. Int. J. Res. Publ. Rev.5(5), 6352–6356 (2024). [Google Scholar]
24.Zhang, X. et al. Smart classrooms: How sensors and AI are shaping educational paradigms. Sensors (Basel, Switzerland)24(17), 5487 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Govea, J., Navarro, A. M., Sánchez-Viteri, S. & Villegas-Ch, W. Implementation of deep reinforcement learning models for emotion detection and personalization of learning in hybrid educational environments. Front. Artif. Intell.7, 1458230 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Yadav, Uma, and Urmila Shrawankar. "Artificial Intelligence Across Industries: A Comprehensive Review With a Focus on Education." AI Applications and Strategies in Teacher Education (2025): 275–320.
27.Marques-Cobeta, N. Artificial intelligence in education: Unveiling opportunities and challenges. In Innovation and Technologies for the Digital Transformation of Education: European and Latin American Perspectives, 33–42 (2024).
28.Gan, W., Dao, M. S., Zettsu, K. & Sun, Y. IoT-based multimodal analysis for smart education: Current status, challenges, and opportunities. In Proceedings of the 3rd ACM Workshop on Intelligent Cross-Data Analysis and Retrieval, 32–40 (2022).
29.Zhou, X., Xuesong, Xu., Liang, W., Zeng, Z. & Yan, Z. Deep-learning-enhanced multitarget detection for end–edge–cloud surveillance in smart IoT. IEEE Internet Things J.8(16), 12588–12596 (2021). [Google Scholar]
30.Halkiopoulos, C. & Gkintoni, E. Leveraging AI in e-learning: Personalized learning and adaptive assessment through cognitive neuropsychology—A systematic analysis. Electronics13(18), 3762 (2024). [Google Scholar]
31.Duan, S., Wang, Z., Wang, S., Chen, M. & Zhang, R. Emotion-aware interaction design in intelligent user interface using multi-modal deep learning. In 2024 5th International Symposium on Computer Engineering and Intelligent Communications (ISCEIC), 110–114 (IEEE, 2024).
32.Sharma, K., Papamitsiou, Z. & Giannakos, M. Building pipelines for educational data using AI and multimodal analytics: A “grey-box” approach. Br. J. Edu. Technol.50(6), 3004–3031 (2019). [Google Scholar]
33.Villegas-Ch, W., Gutierrez, R. & Mera-Navarrete, A. Multimodal emotional detection system for virtual educational environments: Integration into microsoft teams to improve student engagement. IEEE Access13, 42910–42933 (2025). [Google Scholar]
34.Li, Y., Chai, Z., You, S., Ye, G. & Liu, Q. Student portraits and their applications in personalized learning: Theoretical foundations and practical exploration. Front. Digit. Educ.2(2), 1–17 (2025). [Google Scholar]
35.Javed, S., Ezehra, S. R., Ullah, H. & Naveed, M. How AI can detect emotional cues in students, improving virtual learning environments by providing personalized support and enhancing social-emotional learning. Rev. Appl. Manag. Soc. Sci.8(2), 665–682 (2025). [Google Scholar]
36.Zong, Y. & Yang, L. How AI-enhanced social–emotional learning framework transforms EFL students’ engagement and emotional well-being. Eur. J. Educ.60(1), e12925 (2025). [Google Scholar]
37.Thirunagalingam, A. & Whig, P. Emotional AI integrating human feelings in machine learning. In Humanizing Technology With Emotional Intelligence, 19–32 (IGI Global Scientific Publishing, 2025).
38.Annapareddy, V. N., Singireddy, J., Nanan, B. P. & Burugulla, J. K. R. Emotional Intelligence in Artificial Agents: Leveraging Deep Multimodal Big Data for Contextual Social Interaction and Adaptive Behavioral Modelling (2025).
39.Zhang, F., Wang, X. & Zhang, X. Applications of deep learning method of artificial intelligence in education. Educ. Inf. Technol.30(2), 1563–1587 (2025). [Google Scholar]
40.Kolhatin, A. O. From automation to augmentation: a human-centered framework for generative AI in adaptive educational content creation. In CEUR Workshop Proceedings, 143–195 (2025).
41.Sajja, R., Sermet, Y., Cwiertny, D. & Demir, I. Integrating AI and learning analytics for data-driven pedagogical decisions and personalized interventions in education (2023). https://arxiv.org/abs/2312.09548.
42.Parkavi, R., Karthikeyan, P. & Abdullah, A. S. Enhancing personalized learning with explainable AI: A chaotic particle swarm optimization-based decision support system. Appl. Soft Comput.156, 111451 (2024). [Google Scholar]
43.Cheng, S., Liu, Q., Chen, E., Huang, Z., Huang, Z., Chen, Y., Ma, H. & Hu, G. DIRT: Deep learning enhanced item response theory for cognitive diagnosis. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2397–2400 (2019).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The dataset is available from the corresponding author upon individual request.

[CR1] 1.Singh, T. M., Reddy, C. K. K., Murthy, B. V. R., Nag, A. & Doss, S. AI and education: Bridging the gap to personalized, efficient, and accessible learning. In Internet of Behavior-Based Computational Intelligence for Smart Education Systems, 131–160 (IGI Global, 2025).

[CR2] 2.Roumpas, K., Fotopoulos, A. & Xenos, M. A framework for ethical, cognitive-aware human–AI interaction in multimodal adaptive learning systems. In Cognitive-Aware Human–AI Interaction in Multimodal Adaptive Learning Systems.

[CR3] 3.Islam, M. M., Nooruddin, S., Karray, F. & Muhammad, G. Enhanced multimodal emotion recognition in healthcare analytics: A deep learning-based model-level fusion approach. Biomed. Signal Process. Control94, 106241 (2024). [Google Scholar]

[CR4] 4.Qiang, S. U. N. Deep learning-based modeling methods in personalized education. Artif. Intell. Educ. Stud.1(1), 23–47 (2025). [Google Scholar]

[CR5] 5.Hadinezhad, S., Garg, S. & Lindgren, R. Enhancing inclusivity: Exploring AI applications for diverse learners. In Trust and Inclusion in AI-Mediated Education: Where Human Learning Meets Learning Machines, 163–182 (Springer, Cham, 2024).

[CR6] 6.Kumar, R., Kumar, P., Sobin, C. C. & Subheesh, N. P. Blockchain and AI in Shaping the Modern Education System (2025).

[CR7] 7.Lee, A. V. Y., Koh, E. & Looi, C. K. AI in education and learning analytics in Singapore: An overview of key projects and initiatives. Inf. Technol. Educ. Learn.3(1), Inv-p001 (2023). [Google Scholar]

[CR8] 8.Zhou, X. et al. Personalized federated learning with model-contrastive learning for multi-modal user modeling in human-centric metaverse. IEEE J. Sel. Areas Commun.42(4), 817–831 (2024). [Google Scholar]

[CR9] 9.Soman, G., Judy, M. V. & Abou, A. M. Human guided empathetic AI agent for mental health support leveraging reinforcement learning-enhanced retrieval-augmented generation. Cogn. Syst. Res.90, 101337 (2025). [Google Scholar]

[CR10] 10.Xia, B., Innab, N., Kandasamy, V., Ahmadian, A. & Ferrara, M. Intelligent cardiovascular disease diagnosis using deep learning enhanced neural network with ant colony optimization. Sci. Rep.14(1), 21777 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Lateef, M. Harnessing AI and machine learning to elevate educational wearable technology. In Wearable Devices and Smart Technology for Educational Teaching Assistance, 53–80. (IGI Global Scientific Publishing, 2025).

[CR12] 12.Rayudu, K. M., Chinnammal, V., Rubiston, M. M., Padmaloshani, P., Singaravelu, R. & Merlin, N. R. G. Experimental analysis of artificial intelligence powered adaptive learning methodology using enhanced deep learning principle. In 2024 International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES), 1–7 (IEEE, 2024).

[CR13] 13.Salloum, S. A., Alomari, K. M., Alfaisal, A. M., Aljanada, R. A. & Basiouni, A. Emotion recognition for enhanced learning: using AI to detect students’ emotions and adjust teaching methods. Smart Learn. Environ.12(1), 21 (2025). [Google Scholar]

[CR14] 14.Vistorte, A. O. R. et al. Integrating artificial intelligence to assess emotions in learning environments: A systematic literature review. Front. Psychol.15, 1387089 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Liu, Y. et al. Sample-cohesive pose-aware contrastive facial representation learning. Int. J. Comput. Vis.133(6), 3727–3745 (2025). [Google Scholar]

[CR16] 16.Zhang, X., Cheng, X. & Liu, H. TPRO-NET: an EEG-based emotion recognition method reflecting subtle changes in emotion. Sci. Rep.14(1), 13491 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Meng, T. et al. A multi-message passing framework based on heterogeneous graphs in conversational emotion recognition. Neurocomputing569, 127109 (2024). [Google Scholar]

[CR18] 18.Xie, Y., Yang, L., Zhang, M., Chen, S. & Li, J. A review of multimodal interaction in remote education: Technologies, applications, and challenges. Appl. Sci.15(7), 3937 (2025). [Google Scholar]

[CR19] 19.Sangeetha, S. K. B., Immanuel, R. R., Mathivanan, S. K., Cho, J. & Easwaramoorthy, S. V. An empirical analysis of multimodal affective computing approaches for advancing emotional intelligence in artificial intelligence for healthcare. IEEE Access12, 114416–114434 (2024). [Google Scholar]

[CR20] 20.Li, C., Weng, X., Li, Y. & Zhang, T. Multimodal learning engagement assessment system: An innovative approach to optimizing learning engagement. Int. J. Hum. Comput. Interact.41(5), 3474–3490 (2025). [Google Scholar]

[CR21] 21.Khediri, N., Ben Ammar, M. & Kherallah, M. A real-time multimodal intelligent tutoring emotion recognition system (MITERS). Multimed. Tools Appl.83(19), 57759–57783 (2024). [Google Scholar]

[CR22] 22.Sajja, R., Sermet, Y., Cikmaz, M., Cwiertny, D. & Demir, I. Artificial intelligence-enabled intelligent assistant for personalized and adaptive learning in higher education. Information15(10), 596 (2024). [Google Scholar]

[CR23] 23.Chetry, K. K. Transforming education: How AI is revolutionizing the learning experience. Int. J. Res. Publ. Rev.5(5), 6352–6356 (2024). [Google Scholar]

[CR24] 24.Zhang, X. et al. Smart classrooms: How sensors and AI are shaping educational paradigms. Sensors (Basel, Switzerland)24(17), 5487 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Govea, J., Navarro, A. M., Sánchez-Viteri, S. & Villegas-Ch, W. Implementation of deep reinforcement learning models for emotion detection and personalization of learning in hybrid educational environments. Front. Artif. Intell.7, 1458230 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Yadav, Uma, and Urmila Shrawankar. "Artificial Intelligence Across Industries: A Comprehensive Review With a Focus on Education." AI Applications and Strategies in Teacher Education (2025): 275–320.

[CR27] 27.Marques-Cobeta, N. Artificial intelligence in education: Unveiling opportunities and challenges. In Innovation and Technologies for the Digital Transformation of Education: European and Latin American Perspectives, 33–42 (2024).

[CR28] 28.Gan, W., Dao, M. S., Zettsu, K. & Sun, Y. IoT-based multimodal analysis for smart education: Current status, challenges, and opportunities. In Proceedings of the 3rd ACM Workshop on Intelligent Cross-Data Analysis and Retrieval, 32–40 (2022).

[CR29] 29.Zhou, X., Xuesong, Xu., Liang, W., Zeng, Z. & Yan, Z. Deep-learning-enhanced multitarget detection for end–edge–cloud surveillance in smart IoT. IEEE Internet Things J.8(16), 12588–12596 (2021). [Google Scholar]

[CR30] 30.Halkiopoulos, C. & Gkintoni, E. Leveraging AI in e-learning: Personalized learning and adaptive assessment through cognitive neuropsychology—A systematic analysis. Electronics13(18), 3762 (2024). [Google Scholar]

[CR31] 31.Duan, S., Wang, Z., Wang, S., Chen, M. & Zhang, R. Emotion-aware interaction design in intelligent user interface using multi-modal deep learning. In 2024 5th International Symposium on Computer Engineering and Intelligent Communications (ISCEIC), 110–114 (IEEE, 2024).

[CR32] 32.Sharma, K., Papamitsiou, Z. & Giannakos, M. Building pipelines for educational data using AI and multimodal analytics: A “grey-box” approach. Br. J. Edu. Technol.50(6), 3004–3031 (2019). [Google Scholar]

[CR33] 33.Villegas-Ch, W., Gutierrez, R. & Mera-Navarrete, A. Multimodal emotional detection system for virtual educational environments: Integration into microsoft teams to improve student engagement. IEEE Access13, 42910–42933 (2025). [Google Scholar]

[CR34] 34.Li, Y., Chai, Z., You, S., Ye, G. & Liu, Q. Student portraits and their applications in personalized learning: Theoretical foundations and practical exploration. Front. Digit. Educ.2(2), 1–17 (2025). [Google Scholar]

[CR35] 35.Javed, S., Ezehra, S. R., Ullah, H. & Naveed, M. How AI can detect emotional cues in students, improving virtual learning environments by providing personalized support and enhancing social-emotional learning. Rev. Appl. Manag. Soc. Sci.8(2), 665–682 (2025). [Google Scholar]

[CR36] 36.Zong, Y. & Yang, L. How AI-enhanced social–emotional learning framework transforms EFL students’ engagement and emotional well-being. Eur. J. Educ.60(1), e12925 (2025). [Google Scholar]

[CR37] 37.Thirunagalingam, A. & Whig, P. Emotional AI integrating human feelings in machine learning. In Humanizing Technology With Emotional Intelligence, 19–32 (IGI Global Scientific Publishing, 2025).

[CR38] 38.Annapareddy, V. N., Singireddy, J., Nanan, B. P. & Burugulla, J. K. R. Emotional Intelligence in Artificial Agents: Leveraging Deep Multimodal Big Data for Contextual Social Interaction and Adaptive Behavioral Modelling (2025).

[CR39] 39.Zhang, F., Wang, X. & Zhang, X. Applications of deep learning method of artificial intelligence in education. Educ. Inf. Technol.30(2), 1563–1587 (2025). [Google Scholar]

[CR40] 40.Kolhatin, A. O. From automation to augmentation: a human-centered framework for generative AI in adaptive educational content creation. In CEUR Workshop Proceedings, 143–195 (2025).

[CR41] 41.Sajja, R., Sermet, Y., Cwiertny, D. & Demir, I. Integrating AI and learning analytics for data-driven pedagogical decisions and personalized interventions in education (2023). https://arxiv.org/abs/2312.09548.

[CR42] 42.Parkavi, R., Karthikeyan, P. & Abdullah, A. S. Enhancing personalized learning with explainable AI: A chaotic particle swarm optimization-based decision support system. Appl. Soft Comput.156, 111451 (2024). [Google Scholar]

[CR43] 43.Cheng, S., Liu, Q., Chen, E., Huang, Z., Huang, Z., Chen, Y., Ma, H. & Hu, G. DIRT: Deep learning enhanced item response theory for cognitive diagnosis. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2397–2400 (2019).

PERMALINK

A deep learning approach to emotionally intelligent AI for improved learning outcomes

Xiaoyu Wu

Tientien Lee

Umesh Kumar Lilhore

Sarita Simaiya

Roobaea Alroobaea

Abdullah M Baqasah

Majed Alsafyani

Lidia Gosy Tekeste

Abstract

Introduction

Current methods and challenges

Motivation of the work

Key contributions

Organisation of the article

Literature review

Table 1.

Materials and methods

Dataset description

AffectNet dataset

IEMOCAP dataset

Table 2.

Data pre-processing

Preprocessing of AffectNet dataset

Preprocessing of IEMOCAP dataset

Proposed model architecture

Fig. 1.

Algorithm 1.

Vision transformers for facial expression recognition

Fig. 2.

BERT-based models for text sentiment analysis

Fig. 3.

Temporal convolutional networks for speech emotion recognition

Fig. 4.

Graph neural networks for multimodal fusion

Fig. 5.

Fig. 6.

Model training and hyperparameter tuning

Table 3.

Model performance measuring parameters

Evaluation metrics

Batch processing and feedback evaluation

Experimental results and discussion

Simulation setup and details

Data set splitting

Simulation results

Results on AffectNet dataset (facial expression recognition)

Fig. 7.

Fig. 8.

Table 4.

Results on IEMOCAP dataset (speech emotion recognition)

Table 5.

Fig. 9.

Fig. 10.

Results on multimodal fusion using GNNs

Table 6.

Fig. 11.

Extended performance metrics

Confusion matrix

2.

Fig. 13.

AUC ROC curve

Fig. 14.

Multimodal impact analysis

Table 7.

Fig. 15.

Analysis of misclassifications

Table 8.

Fig. 16.

Task completion impact

Table 9.

Fig. 17.

Robustness and generalisation

Table 10.

Fig. 18.

Statistical significance (p-test and other tests)

p-Test for comparison of precision, recall, and F1 scores

Table 11.

Analysis of variance (ANOVA) for comparison of multiple models