Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Dec 3;16:105. doi: 10.1038/s41598-025-29073-4

Empowering emotional intelligence through deep learning techniques

B V Gokulnath 1,, Pampana Charmitha 1, Pampana Chathurya 1, D Lavanya Satya Sri 1, Baratam Vennela 1, S P Siddique Ibrahim 1, S Selva Kumar 1
PMCID: PMC12764474  PMID: 41339679

Abstract

We propose that employing an ensemble of deep learning models can enhance the recognition and adaptive response to human emotions, outperforming the use of single model. Our study introduces a multimodal emotional intelligence system that blends CNNs for facial emotion detection, BERT for text mood analysis, RNNs for tracking emotions over time, and GANs for creating emotion-specific content. We built these models with TensorFlow, Keras, and PyTorch, and trained them on Kaggle datasets, including FER-2013 for facial expressions and labeled text data for sentiment tasks. Our experiments show strong results: CNNs reach about 80% accuracy in recognizing facial emotions, BERT achieves about 92% accuracy in text sentiment, RNNs reach around 89% for sequential emotion tracking, and GANs produce personalized, age-related content that is judged contextually appropriate in over 90% of test cases. These findings support the idea that a combined model architecture can yield more accurate and adaptable emotional responses than simpler approaches. The framework could be useful in areas such as healthcare, customer service, education, and digital well-being, helping to create AI systems that are more empathetic and user-focused.

Keywords: Emotional intelligence, Deep learning, Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Bidirectional Encoder Representations from Transformers (BERT), Generative Adversarial Networks (GANs), Facial emotion recognition, Sentiment analysis, Multimodal emotion recognition, Human–computer interaction

Subject terms: Medical imaging, Engineering

Introduction

Today, computer systems that try to understand feelings often can’t recognize or respond well to different emotions from people of different ages or ways of talking. These systems usually don’t adjust to each person or the situation, so they can’t give good emotional help or connect in a meaningful way. This project wants to fix this problem by creating a smart system that can understand feelings from both facial expressions and text. It will also respond with personalized and suitable content based on the user’s age to help them feel better. The system is made to be caring, understand the situation, and change according to the user’s feelings over time.

The goal of is to design an emotional intelligence system that identify and react to human emotions, using deep learning models. By recognizing emotional states, the system aims to offer personalized content to improve the user’s emotional well-being. The system identifies emotions and also provides age-appropriate customized content. For children, it provides anime-style images, for adults it generates age- based poems, and for the elderly, it suggests books. To achieve this, the system uses several deep learning models. EfficientNet introduces a compound scaling method to optimize convolutional neural networks (CNNs). By simultaneously adjusting depth, width, and resolution, this approach enhances performance while reducing computational demands. It demonstrates superiority over models like ResNet and Inception in terms of both accuracy and efficiency, highlighting the critical role of balancing scaling factors in advancing model design.1. In this project, facial expression recognition is implemented using Convolutional Neural Networks (CNNs), a type of deep learning model designed specifically for processing and analyzing visual data. CNNs operate by applying layers of filters to images, enabling them to identify complex patterns and unique features. This capability makes CNNs particularly suited for recognizing facial expressions. By leveraging this method, the system can effectively detect emotions such as happiness, sadness, and surprise by analyzing subtle variations in facial features and expressions. The system also integrates BERT , a powerful model widely used in the field of natural language processing (NLP), to complement the based on emotion recognition. What sets BERT apart is its ability to process text in both direction examining the words before and after each target word. This bidirectional method enables the model to fully understand the background of each word, unlike traditional models that process text in a single direction. Such a method allows BERT to capture subtle variations in meaning and sentiment, making it particularly appropriate for sentiment analysis jobs. Therefore, the system is able to properly understand the user’s text’s emotional tone and respond in a manner that aligns with their mood, creating a more empathetic and context-sensitive interaction. The system leverages Recurrent Neural Networks (RNNs) alongside BERT to better interpret and respond to the user’s emotional state. RNN’s are ideal for working with sequential data, such as text or speech, as they rely on earlier inputs to interpret the current context. Their distinct design includes a feedback system that allows them to retain knowledge from before steps, allowing them to adjust their processing based on previous data for more context-aware responses. In the case of emotional analysis, RNNs are employed to analyze sequences of words in user input, helping the system detect emotional shifts or patterns across a conversation. This makes RNNs especially useful in tracking the progression of a user’s emotional state over time. By processing the text sequentially, RNNs can identify the emotional tone not just in isolated sentences, but in the broader context of the entire dialogue. This enables the system to generate more context-aware responses, adapting to the user’s emotional journey throughout the interaction and ensuring a more dynamic and empathetic conversation. It refers to a framework with two neural networks operating against each other: a discriminator and a generator. The discriminator determines if the input is authentic or synthetic, while the generator synthesizes data meant to resemble real- world examples. Both networks improve their performance over time as a result of this competitive process, producing data that is incredibly realistic. The development of the generative modeling technique has been greatly aided by GANs’ remarkable capacity to produce synthetic pictures and other types of data.

The rest of this paper is organized like this: Section “Literature survey” looks at other work in deep learning for recognizing emotions. Section “Methods” explains the method we used, including the models and data sets. Section “Methodology” talks about how well the system predicts and performs. Section “Results” shows the experimental results. Section “Limitations” describes the tools we used and the main limitations of the study. Section “Conclusion” summarizes what we found and its meaning, and Section “Future Scope” suggests ideas for future research.

Literature survey

Models in deep learning, including CNNs, RNNs, and bidirectional transformers like BERT, are playing a key role in advancing the field of emotional intelligence research. CNNs are perfect for recognizing images as they are very good at image-related tasks, emotions through facial expressions. RNNs, particularly long short-term memory (LSTM) networks, specialize in evaluating sequential data such as text and detecting emotional patterns through time. Bidirectional models, such as BERT, enhance text-based emotion recognition by examining the context surrounding each word in a sequence, leading to more accurate interpretations of emotional nuances. In one example,which involves a CNN-bidirectional LSTM hybrid model that set a standard for face emotion identification tasks by obtaining cutting edge accuracy in expression on the face detection on CK+ dataset. In the same direction, a CNN-bidirectional LSTM model that performed better in facial emotion categorization than alternative techniques was proposed. By adding a text extraction technique to their model, it increases CNN’s capacity for emotion identification and text data analysis, however this came with added preprocessing requirements. Bidirectional LSTMs and RNNs, which can analyze information from both past and future contexts, have shown strong potential in sentiment analysis and emotion recognition within natural language processing (NLP).The importance of CNN-LSTM networks in emotion recognition from face data was highlighted , which got use of the global spatial dependency of facial expressions for more precise categorization. Although face-based emotion identification has been studied extensively, Just a few of research have combined CNN along with RNN architectures for facial expression detection with BERT for nuanced emotion extraction from textual input. By using CNNs for visual emotion recognition, RNNs for sequential text analysis, and BERT for contextualized emotion extraction from textual interactions, our team seeks to close this gap. With wide-ranging applications in customer service, mental health, and healthcare, this method may increase the precision and reactivity of emotionally intelligent systems.

EfficientNet, proposed by Tan and Le.1 offers a new method for convolutional neural networks (CNNs) capacity. for improved performance during fewer parameters. This method employs a compound scaling technique that uniformly adjusts depth, width, and resolution. By employing this technique, EfficientNet achieves cutting-edge results on various benchmarks, offering improved accuracy with reduced computational demands. This advancement has become pivotal in optimizing deep learning models for computer vision tasks.1

Recursive Deep Learning for Sentiment Analysis Socher et al.2 explored use of recursive neural networks for sentiment analysis, which improves the understanding of complex linguistic structures by modeling hierarchical sentence structures. Their work has helped to construct more robust sentiment analysis models by using a semi-supervised technique to training deep models that uses a limited quantity of labeled data and a larger pool of unlabeled data.2

Multimodal Sentiment Analysis for social media during Emergencies Poria et al.3,4 introduced a multimodal method for sentiment analysis that combines textual and visual data from social media in the context of public emergencies. They focused on understanding emotional reactions during crises, highlighting the importance of incorporating multimodal data for more accurate emotion and sentiment detection in complex, real- world environments.3

Deep Convolutional Networks for Emotion Recognition in Human-Robot Interaction Zhang et al.5 presented an improved convolutional neural network (CNN) for emotion detection in human-robot interaction systems. They demonstrated how CNNs could be optimized for emotion detection, enabling robots to interpret human emotions more effectively. Their approach aimed at enhancing human-robot collaboration by making the interaction more intuitive and empathetic.It revolutionized unsupervised learning through a novel framework. This framework utilizes two neural networks, a generator and a discriminator, that work in opposition to each other. Through this adversarial process, both networks enhance their performance, with the generator eventually creating data it is very similar in the actual world examples. GANs have significantly influenced the creation of synthetic data and opened new possibilities in generating realistic images, audio, and videos.5

AffectNet Database for Emotion Recognition Mollahosseini et al.6 developed AffectNet, their dataset, which is large-scale and focused on facial expressions, supports the training of emotion recognition systems. It contains images that are labeled with facial expressions, valence, and arousal, aiding in the creation of effective emotion recognition models. This dataset has become a key resource for training CNN-based systems aimed at real-world emotion detection tasks7

BERT:Devlin et al.8 introduced BERT (Bidirectional Encoder Representations from Transformers), a pre-trained model aimed at improving various NLP tasks. By utilizing a bidirectional attention mechanism, BERT can collects background information from both before and after, significantly enhancing its language comprehension. This innovation has raised the bar for tasks such as sentiment analysis and question answering, outperforming earlier models.6

He et al.9 pioneered deep residual learning, a technique that allows the effective training of very deep networks by addressing the vanishing gradient issue. Residual networks (ResNets) have become a key architecture in computer vision, enabling the training of networks with hundreds or even thousands of layers, which leads to significant improvements in image recognition accuracy.8

EfficientNet: Revisiting Model Scaling Tan and Le.1 revisited their earlier work on EfficientNet, refining their model scaling strategy. They showed that by thoughtfully adjusting the depth, width, and resolution of a model, it is possible to create more efficient networks that deliver improved performance while minimizing the number of parameters, The balance between computational expense and accuracy.9

Sequence to Sequence Learning with Neural Networks Sutskever et al.10 introduced the sequence to sequence (seq2seq), ode; A revolutionary approach for tasks involving input-output sequences, such as machine translation. Their model employs two recurrent neural networks (RNNs) to transform input sequences into output sequences, facilitating progress in language processing tasks and expanding the use of deep learning in natural language understanding.1

EmoDet2: Combining Neural Networks for Emotion Detection The paper introduces a method for extracting text lines handwritten documents using distance transforms. It tackles challenges like diverse handwriting styles and orientations by grouping connected components into structured text lines. The approach involves preprocessing to reduce noise, clustering components, and refining results, improving the accuracy of handwriting recognition systems.11 BERT-CNN for Emotion Detection a hybrid model that combines BERT, which excels in capturing contextual information, with Convolutional Neural Networks (CNNs), which are adept at identifying spatial patterns. The model is designed to enhance emotion detection in textual data by leveraging BERT’s ability to understand word relationships and CNN’s efficiency in feature extraction. This approach improves the model’s performance in recognizing subtle emotional cues that are often embedded in complex language structures. 10

Speech Emotion Recognition Using Deep Neural Networks focuses on employing deep neural networks (DNNs) for speech emotion recognition, aiming to automatically classify emotional states from audio signals. The paper emphasizes the ability of DNNs to learn intricate, non-linear patterns from raw audio data, which leads to better performance in detecting subtle emotional variations. This approach is shown to improve the robustness and accuracy of emotion recognition. Making it applicable to various real-world applications like virtual assistants and customer service.12

Hierarchical Contextual Emotion Detection focuses on understanding emotions within hierarchical contexts, such as conversations, where the emotional tone can change depending on the preceding and following dialogue. The approach takes into account the broader context of the conversation, which helps the model better capture shifts in emotional states. By using hierarchical processing, the model achieves more accurate predictions in scenarios where emotions are influenced by the discourse flow, rather than isolated statements.13

Text Classification with BERT and Attention Mechanisms combines BERT’s pre-trained language model with attention mechanisms to improve text classification tasks, specifically focusing on emotion and sentiment analysis. The attention mechanism enables the model to focus on important parts of the text, such as emotionally significant words, allowing it to better understand the context and nuances. The approach leads to improved classification accuracy, especially for tasks involving complex emotional expressions and sentiments.14

Deep Learning for Sentiment and Emotion Analysis review of deep learning methods for sentiment and emotion analysis, this paper discusses various architectures, including CNNs, RNNs, and transformers, and their effectiveness in handling the complexities of natural language. The authors explore how each model contributes to better understanding emotions in text, highlighting the role of deep learning in processing large datasets and capturing complex relationships between words that signify sentiment and emotional tone.15

Implicit Emotion Detection Using Attention Mechanisms addresses the challenge of detecting implicit emotions in text, which are often conveyed subtly and not directly stated. By using attention mechanisms, the model can focus on key phrases and cues that suggest hidden emotional states. This method allows the model to detect nuanced emotions in text that traditional approaches may miss, improving overall detection accuracy in sentiment analysis tasks.16

Deep Learning Techniques for Multimodal Emotion Recognition the authors explore multimodal emotion recognition, which involves integrating text, audio, and visual data using deep learning techniques. By combining these different types of data, the model can leverage complementary features that improve emotion detection.17

The research emphasizes the importance of feature fusion, where the model learns to combine data from multiple modalities to get a more comprehensive understanding of the emotional context. GANs for Synthetic Emotional Image Generation investigates the use of Generative Adversarial Networks (GANs) for generating synthetic images that express specific emotions. These synthetic images are valuable for enhancing datasets used in emotion recognition systems, as they provide diverse and controlled examples of emotional expressions.18

The research demonstrates how GANs can create realistic images that reflect a wide range of emotional states, which can then be used to train more robust emotion recognition models.19

Transformer-based Emotion Detection in social media applies transformer- based models to detect emotions in social media texts, where informal language, slang, and abbreviations are commonly used. Transformers, particularly BERT, are effective in handling these challenges due to their ability to understand contextual relationships between words, regardless of the text’s informal structure. The paper focuses on how transformer models can be scaled to handle large social media datasets and accurately detect emotions in posts, tweets, and comments.20

This paper introduces a combined model (CNN + Bi-LSTM + Attention) to recognize emotions from EEG signals and gets very high accuracy (99.79%). By mixing CNN (for spatial patterns) with Bi-LSTM and Attention (for timing and important parts), the method captures both where and when useful signals occur, giving strong and reliable results.21

The approach combines GANs with BERT to improve text classification, especially when there isn’t much labeled data. By using GANs, it needs fewer labeled examples and helps the model generalize better to new data.22

This method mixes BERT and CNN so it can learn both the overall context (from BERT) and local word patterns (from CNN) for emotion detection. It reaches a high accuracy of 94.7%. A 3D-CNN with attention and an RNN is used to recognize emotions from different sources like video and audio23. A framework that uses sound (audio), pictures (video), and writing (text) features to better detect emotions with deep learning24. Uses speech and video data from the IEMOCAP dataset and runs neural networks to classify emotions in real time25. Adapts BERT to recognize emotions in many languages, especially where there isn’t much labeled data26. Explores using GANs and VAEs to create data that shows emotions, to help improve how we classify emotions27. Puts together different deep learning models to make a group (ensemble) that detects emotions more accurately and reliably28. Uses NLP with deep learning to find emotions in healthcare-related text, helping make decisions that consider feelings29. Introduces the Aff-Wild dataset and several deep learning models for predicting real emotions using video data30. Focuses on learning from raw sound and pictures all in one go to recognize emotions across different senses (multimodal)31.

Uses attention methods with convolutional networks to improve facial emotion recognition by focusing on important features4. Reviews methods and models for recognizing facial expressions, focusing on CNN-based deep learning32. Uses speech features and their written transcripts to build a two-input system that detects emotions more accurately33. Highlights why choosing strong audio features is important for recognizing emotions in speech with deep models34. Thorough review of methods that combine computer vision (CV) and deep learning (DL) to detect emotions from faces and body language35. Proposes a new model that uses the scene around you plus facial expressions to classify emotions36.

Summaries different CNN and RNN models for facial emotion detection, showing recent trends and challenges in an easier way37. A ready-made list of deep38 learning models for speech emotion recognition, with improvements using attention39.

Methods

In this project, we picked different models because they are very good at dealing with different kinds of data needed to recognize emotions. CNNs are best for looking at facial images because they can notice important details in pictures. We use BERT to understand emotions in text since it pays attention to the meaning of the words. RNNs are helpful for finding changes in emotions over time in sequences, like sentences. Finally, we use GANs to make special emotional content, It customizes output for different age groups: anime images for children, poems for adults, and book recommendations for elders that shows personal feelings. By using these models together, we improve our ability to recognize emotions from different types of data. Our approach combines CNNs for facial emotion recognition, BERT for sentiment interpretation, GANs for poetry generation, and RNNs for tracking emotional trends. Each model is tuned to enhance system responsiveness and accuracy, working together to provide real-time, adaptive emotional support based on user input. This integrated framework ensures that users receive tailored content, promoting emotional awareness and engagement. Figure 1 shows the whole system design and how each part of the model works together in the emotional intelligence framework.

Fig. 1.

Fig. 1

Flowchart of the emotional intelligence system that combines CNN, BERT, RNN, and GAN models to create personalized emotional responses using different types of data.

Prediction and performance calculation

To evaluate the systems effectiveness in recognizing and responding to emotional cues, we implemented rigorous prediction performance Using real-time input data, our model predicts user emotions and generates tailored responses with high accuracy. We measured model performance of the models by metrics such as accuracy, precision, recall, and F1 score, assessing each component’s ability to classify emotions accurately and generate relevant content. This evaluation framework allows us to continuously refine the system’s predictive capabilities and ensure that its responses effectively support users’ emotional needs

Methodology

Data gathering

This project gathers information about two types of datasets. We used two types of data for our project: text and images. The text dataset contains sentences labeled with different feelings (like happy or sad) in a CSV file, which we used to train and test our BERT and RNN models. The image dataset includes pictures labeled with emotions such as happy, sad, and surprised, organized into folders for training our CNN and GAN models. Additionally, we created age-specific datasets, including anime images for children, emotion-based poems for adults, and book metadata for elders. We divided all the data into three parts: 70

  • Text Data: we have used the CSV file that include test.csv ,training.csv and validation.csv which contains the columns like text and label. In text column it describe the sentences of different emotions with corresponding to the sentences it gives label values. This datasets are used for training and evaluating the RNN and BERT models.

  • Image Data: This image data contains the two different folders like train and test for each folders there is a subfolders of different emotions like happy, sad, neutral, surprised, fearful, disgusted, and angry each sub folders has corresponding images, which are used for training CNN and GAN models.

  • Age-Specific Datasets: This datasets describes about different age specific related datasets like children the collected dataset is anime, for adults the collected datasets is poems and elderly the collected dataset is book recommendation.

Data preprocessing

  • Text Preprocessing: In RNN model, combined the datasets training.csv,test.csv and validation.csv we performed tokenization using tokenizer that limits the words and Label Encoder that used to convert emotions into numeric values. In BERT model, Label Encoder is performed by converting emotion labels into numeric values. The dataset is split into training and testing sets, and the labels are converted into TensorFlow tensors for model training purposes.

  • Image Preprocessing: In the CNN model, the dataset is preprocessed by resizing the images, converting them to grayscale, normalizing pixel values, and applying data augmentation techniques like rotation, flipping, and zooming to help reduce overfitting. One-hot encoding is also utilized for more efficient training. For the GAN model, the dataset includes MNIST images of handwritten digits, primarily used for training and evaluating model performance. These grayscale images are a standard benchmark for assessing the model’s effectiveness

Model training

  • CNN for Emotion Recognition: The CNN model was trained using tagged facial photos, which helped it to recognize patterns and features associated with various emotions. To improve the model’s accuracy in recognition, we changed hyperparameters such learning rate and batch size. The network has three convolutional layers that use ReLU (a simple function that keeps positive signals), then uses pooling to shrink the data and two dense (fully connected) layers to make the final decision. It was trained in groups of 32 images at a time and the full training set was run through the model 10 epochs.

  • BERT for Sentiment Analysis: The BERT model was adapted using a sentiment-labeled dataset, enabling it to recognize subtle emotional nuances in user input. This adaptation increases the model’s capacity to understand complex emotional expressions in the language. We took a pre-trained BERT base model and adjusted it for our task by adding a softmax classifier on top. We trained (fine-tuned) the whole model a little more using a small learning rate of 0.00002, processing 16 examples at a time, and repeated over the data 4 times.

  • GAN for Poetry Generation: Trained on emotion-specific poetry, the GAN model learns to generate poetry that aligns with the user’s detected emotions. This customization ensures the generated poetry resonates with the user’s emotional state, creating a personalized experience. The system has two parts: a generator that uses LSTM (a type of recurrent neural network) and a discriminator made of dense (fully connected) layers. It was trained for 50 full passes through the data, using binary cross-entropy to measure error, and it looked at 64 examples at a time during training.

  • RNN for Temporal Emotion Tracking: Sequential data from user interactions was used to train RNNs, which analyze changes in emotion over time. This model adapts to mood shifts, enabling dynamic responses that align with evolving user emotions. The model has two LSTM layers and ends with a softmax output that gives probabilities. It uses word/item embeddings of size 128, is trained on 32 examples at a time, and the whole dataset is passed through the model 10 times.

Result: Figure 2 Graph indicates about the performance metrics like F1 score, Recall, Precision across different classes.

Fig. 2.

Fig. 2

Overall models of comparison of performance metrics

During the testing phase, we made a few simplifying assumptions to keep the experiments doable. First, we assumed that the Kaggle datasets used (FER-2013 for facial expressions and labeled text data for sentiment) are good enough to train models for real-world emotion detection. Second, we assumed emotions are expressed in a culturally neutral way, even though there can be differences. Third, when creating age-specific outputs, we assumed that anime tends to appeal to children, poetry to adults, and books to older people, though these ideas may not apply to everyone. Finally, we assumed that the training and test splits have balanced number of different emotions and that user will provide facial or text input in a consistent manner during interactions.

Performance evaluation

  • Accuracy: We measured accuracy as the proportion of correctly predicted emotions across all samples, delivering an overview of each model’s reliability in emotion recognition and response production.

  • Precision and Recall: Precision quantifies how well positive emotion predictions work, showing the model’s capacity to lower false positives. The model’s recall measures how sensitive it is to recognizing every case of a certain emotion, indicating how easily it can identify key emotions.

  • F1 Score: To achieve a balance between precision and recall, we used the F1 score, which offers a more complete assessment of the model’s performance, particularly in scenarios with class imbalance.

  • User Response Relevance: To ensure the generated poetry aligns meaningfully with user emotions, we assessed the

relevance and emotional resonance of responses. This step from user interactions was used to train RNNs, which analyze changes in emotion over time. This model adapts to mood shifts, enabling dynamic responses that align with evolving user emotions. Figure 3 shows the model’s accuracy and loss during training, which helps us understand how well it is learning and adjusting to changing emotions.

Fig. 3.

Fig. 3

Model accuracy and model loss

Results

Training and Loss Calculation: Each model—CNN, BERT, RNN, and GAN—was trained on respective datasets tailored to their tasks. Metrics like training and validation accuracy as well as loss curves were analyzed to assess performance. The CNN and GAN models showed effective learning through gradually decreasing loss, while the BERT model achieved high precision in sentiment analysis. For GANs, loss analysis confirmed convergence between generator and discriminator. Model Evaluation: The system was evaluated using metrics such as accuracy, precision, recall, and F1 scores.

The evaluation highlighted BERT’s superior ability to handle nuanced textual emotional expressions. Confusion matrices and ROC curves were used for detailed assessments of classification quality, emphasizing the models’ robustness in emotion recognition. We first looked at the testing accuracy of our deep learning models, shown in Figure 4. To better understand how well the models classify emotions, we created a classification report and a confusion matrix, which shows how often the model predicted each emotion correctly (Figure 5). Figure 6 shows the ROC curve, which compares how well the model detects true positives against false positives at different settings.

Fig. 4.

Fig. 4

Shows the testing accuracy, giving a clear picture of how well the model did on the test data.

Fig. 5.

Fig. 5

Shows the classification report and confusion matrix, giving a detailed look at the model’s precision, recall, and how well it classifies data. The CNN model most reliably classifies “happy” and “neutral” emotions, with fewer misclassifications than “fearful” or "disgusted," according to the confusion matrix (Figure 5). Misclassifications typically happen between visually similar emotions, as “fearful” and "surprised."

Fig. 6.

Fig. 6

Shows the ROC curve, which explains how the model balances correctly detecting positives and avoiding false alarms at different settings. which shows an average AUC of 0.91 for CNN, 0.95 for BERT, and 0.93 for the RNN. This performance is in directly supported by the GAN model, which generates realistic, emotionally consistent augmentation outputs.

We also compared different algorithms—RNN, BERT, CNN, and GANs—side by side using their performance scores, as seen in Figure 7. Besides the numbers, we looked at how well the models create responses for different age groups. For example, Figure 8 shows the model’s output for a happy elderly person, Figure 9 shows the output for a happy child, and Figure 10 shows the response for an adult feeling fear. Table 1 represents ROC–AUC Values for Different Models and Table 2 represent Comparison of Proposed Model with Prior Research.

Fig. 7.

Fig. 7

Compares the accuracy of all models, clearly showing how they perform compared to each other.

Fig. 8.

Fig. 8

Emotion-based output for elderly (Happy Emotion)

Fig. 9.

Fig. 9

Emotion-based output for child (Happy Emotion)

Fig. 10.

Fig. 10

Emotion-based output for adult (Fear Emotion)

Table 1.

ROC–AUC values for different models

Model AUC (Happy) AUC (Sad) AUC (Angry) AUC (Surprise) Average AUC
CNN 0.88 0.85 0.90 0.86 0.87
BERT 0.94 0.91 0.95 0.92 0.93
RNN 0.90 0.88 0.89 0.89 0.89
GAN 0.91 0.90 0.92 0.89 0.90

Table 2.

Comparison of proposed model with prior research

Model Dataset Used Accuracy (%) F1-Score
CNN + LSTM CK+ 84 0.82
BERT(Devlin et al.8) IMDB 90 0.88
CNN + BiLSTM  (15) FER – 2013 87 0.85
Proposed Model (CNN + BERT + RNN + GAN) FER – 2013 + Text Dataset 92 0.90

Finally, Figure 11 summarizes the testing accuracies: CNN got 80%, BERT 92%, RNN 89%, and GANs 90%.

Fig. 11.

Fig. 11

Accuracy table

Comparative analysis of hybrid and simpler models

We tested how well the different models work together compared to simpler or single-model options. The CNN did well on facial emotions, the BERT did well on text sentiment, the RNN handled moving emotions over time, and the GAN could create emotional content, especially when guided by other models. When they are combined, the system is more reliable because each model covers a different part: CNNs pick up subtle facial cues, BERT finds meaning in text, RNNs track changes over time, and GANs generate personalized outputs. Overall, the combined approach gave higher accuracy and tasks, but the integrated design is generally more effective for emotion-aware applications.

Experimentation tools

We built our models using Python, along with TensorFlow and Keras for deep learning. To prepare the data, we used pandas and NumPy. We created graphs and looked at performance results using Matplotlib and Seaborn. For the BERT models, we used the Hugging Face Transformers library. For training the GAN, we chose PyTorch because it is more flexible for creating new things.

Limitations

  • User Dependency: The effectiveness of the system is closely tied to how actively users engage with it. If users through facial expressions or text, the system may find it difficult to generate accurate or meaningful responses, which could reduce its overall effectiveness.

  • Challenges with Age Specific Content: Although the system is designed to provide age-appropriate responses, it may not fully account for the emotional sensitivities of various age groups, especially children and older adults. As a result, the content may not always align with the user’s emotional state or specific needs.

  • Limited Emotional Data Input: The system predominantly relies on facial expressions and text to determine emotional states, which may overlook other critical emotional signals such as voice tone or body language. This limitation could hinder the accuracy with which the system can read the full range of emotional expressions.

  • Real-time Response Constraints: Although the system aims for real-time emotional feedback, the complexity of analyzing multiple types of input, including text and facial expressions could lead to processing delays. This might impact the seamless interaction experience for users, especially in high-demand environments.

  • Even though the system shows high accuracy in finding and reacting to emotions, it’s important to understand that deep learning models cannot truly understand or show human empathy. They simply imitate emotional patterns they learned from data. This brings up philosophical and ethical questions: even if these systems can pretend to be empathetic and offer comfort, they do not have self-awareness or moral thinking like real humans. Future work should view these models as helpful tools that support people, not as substitutes for real human emotional interaction.

  • The current framework mainly uses facial expressions and text to figure out emotions. While this works in many situations, it doesn’t capture the full complexity of human feelings. Non-verbal cues like voice tone, movements, and body signals also matter for emotions, but they aren’t included right now. This limits how strong the system can be in real-world, multimodal use.

  • Although the system currently generates age-specific outputs (anime for children, poems for adults, and book recommendations for seniors), this approach risks reinforcing stereotypes by assuming uniform preferences within age groups. Emotional needs and cultural contexts vary widely among individuals, and fixed mappings may not always reflect user expectations. Future iterations of the system will focus on adaptive personalization, where recommendations are refined through user feedback and preference learning rather than relying solely on age-based categories. This adjustment will ensure that emotionally intelligent AI systems remain inclusive, flexible, and sensitive to individual differences.

  • Another problem is that the model relies on Kaggle datasets for training and testing. These datasets are common benchmarks, but they may not represent the full variety of real-world emotional expressions across different cultures, languages, and groups. Because of this, the system might not perform as well when used with people outside the training data.

Ethical and cultural considerations

Introducing emotionally intelligent AI brings important ethical and cultural questions. Privacy is a major issue because gathering and analyzing facial and text data for recognizing emotions involves personal and sensitive information. It’s essential to handle data securely and use privacy-friendly methods, like processing data locally or removing identifying details, to keep users’ trust. Fairness is also important because deep learning models trained on small or biased data can unfairly favor some groups and perform worse for others. Additionally, how people show emotions varies a lot across cultures, so systems that ignore these differences may misunderstand users or respond inappropriately. To address these issues, we should use more diverse training data, include cultural context in how the model is built, and follow open and ethical AI practices. By putting these safe guards in place, the technology’s benefits can be aligned with real-world responsibility and trust.

Conclusion

This study shows that deep learning can help make emotionally intelligent AI better. By using CNNs, RNNs, BERT, and GANs, the system can read emotions from faces and text, and also give personalized responses that fit a person’s age. The findings suggest that combining these specialized models works better and is more flexible than using just one model. Besides the technical ideas, the work highlights that ethics and culture matter when designing emotion-aware AI. Privacy, fairness, and the many ways people express emotions must stay at the heart of future progress. The system looks promising for healthcare, therapy, and mental wellness, but using it in customer service and entertainment brings both chances and risks that need careful rules. In the future, adding more ways to sense emotion—like voice tone, body movements, and body signals—will help the AI understand feelings more fully. With these improvements, emotionally intelligent AI could change how people interact with computers, as long as it is developed responsibly and respects user well-being and cultural differences. To improve generalization, future research will validate the system using a wider range of real-world datasets that capture cultural and language differences in how people express emotions. This approach will help make sure the system works well for diverse user groups and is suitable for use worldwide.

Emotionally intelligent AI could help a lot in delicate areas like health care and therapy. It can give emotional support, help doctors watch how patients are doing, or be a virtual friend for people who feel stressed or lonely. It also has value in customer service and entertainment, where understanding feelings can make people happier with the service or experience. But using these systems in new areas also brings risks, such as tricking people, reinforcing stereotypes, or using emotional data for business tricks. So, it’s important to balance new ideas with safeguards that protect users, and to be open, fair, and focused on people’s well-being. Beyond just how well it works technically, future development of emotionally intelligent AI should also take ethics into account, including privacy, fairness and awareness of cultural differences in how many people show emotions. Keeping these factors in mind will help ensure these systems are used responsibility and effectively in real-world settings. In conclusion, this study lays the foundation for developing AI systems that are more empathetic, capable of making significant contributions to mental health, emotional well-being, and fostering meaningful interactions between humans and AI.

Future Scope

  • Better Emotion Detection: In the future, the system can be upgraded by incorporating additional data sources, like voice tone and body movements, to better understand a person’s emotions. By doing so, it could recognize emotional states more precisely and offer more customized and relevant responses to each individual.

  • Cultural Sensitivity in Emotion Detection: Currently, the system might not fully account for the various ways emotions are expressed in different cultural contexts. Future developments could involve creating models that are better equipped to recognize and adapt to these cultural nuances, allowing the AI to respond more accurately to emotional cues from people of diverse backgrounds.

  • Real-time Emotion Adaptation: In the future, the system could adjust its responses in real time based on a person’s changing emotions. This would be helpful in areas like online learning, video games, or customer service, where the system can respond dynamically to keep the user engaged.

  • Ensuring Privacy in Emotion Recognition: As AI systems are increasingly used for personal interactions, safeguarding user privacy becomes crucial. Future models could prioritize processing emotional data locally on users’ devices, rather than transmitting it to central servers, to ensure better privacy protection. Support in

  • Virtual Companions and Therapy: Emotionally intelligent shows the skill to play a key role in virtual companionship and therapeutic settings. By enhancing how the system connects with users, it could provide emotional support in areas like anxiety management and offer companionship to elderly individuals, fostering a sense of conection and well-being.

  • To fix the problem of using only facial expressions and text, future work will add more ways to read emotions, like voice tone, body movements, and body signals (for example heart rate and skin responses). Using these different signals together could give a clearer and more complete picture of how people feel, making the system work better in many real-world situations.

  • Right now, the system mainly looks at facial expressions and written input. It does not yet use other important signals like how someone speaks, body movements, or body signals such as heart rate or skin moisture. These extra signals can help us understand emotions more clearly, especially when facial expressions aren’t obvious. For example, a person’s voice tone can show sarcasm or stress even if their face stays neutral, and heart rate or skin signals can reveal feelings that aren’t seen on the outside. Adding these ways of sensing emotions in the future would help us understand people better and make the system more reliable, especially in important areas like healthcare and therapy.

Author contributions

DR.B. V. GOKULNATH - Conceptualization and Methodology Pampana Charmitha - Software Pampana Chathurya - Validation D.Lavanya Satya Sri - Formal analysis Baratam Vennela - Data curation DR.S.P. Siddique Ibrahim - Visualization DR.S. Selva Kumar - Resources

Funding

This Research did not receive any specific grant from funding agencies.

Data availability

The datasets in this study are free to use and were downloaded from Kaggle. [https://www.kaggle.com/datasets/ananthu017/emotion-detection-fer] [https://www.kaggle.com/datasets/splcher/animefacedataset] [https://www.kaggle.com/datasets/giacometti96/poems-txt] [https://www.kaggle.com/datasets/mdhamani/goodreads-books-100k]

Declarations

Competing interests

The authors declare no competing interests.

Consent for Publication

All individuals featured in the submitted images provided written consent for their use in this publication. I confirm that all methods were carried out in accordance with relevant guidelines and regulations. I further confirm that all experimental protocols were approved by the VIT AP university. Informed consent was obtained from all subjects to participation in the original data collection. Additionally, informed consent for publication of the data and images was also obtained from all subjects and/or their legal guardians.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets in this study are free to use and were downloaded from Kaggle. [https://www.kaggle.com/datasets/ananthu017/emotion-detection-fer] [https://www.kaggle.com/datasets/splcher/animefacedataset] [https://www.kaggle.com/datasets/giacometti96/poems-txt] [https://www.kaggle.com/datasets/mdhamani/goodreads-books-100k]


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES