Abstract
Online education offers flexibility but often suffers from reduced learner engagement. This study developed an automated method to detect emotional engagement using an optimized Vision Transformer model with Transfer Learning. Facial data from 40 undergraduates produced a dataset of 71,185 labeled images across three engagement levels. The proposed model achieved 93.8% classification accuracy, surpassing conventional machine learning and deep learning baselines. Analysis showed engagement typically declined after six minutes of learning, with a modest rebound near session end. Pearson correlation revealed a significant positive relationship between engagement and learning outcomes, indicating that emotionally engaged learners achieved higher academic performance. These results demonstrate the feasibility of deep learning–based approaches for scalable monitoring of learner engagement and highlight the important role of emotional states in shaping online learning effectiveness. The findings provide practical insights for designing adaptive interventions to sustain attention and optimize digital learning environments.
Keywords: Emotional engagement, Online learning, Artificial intelligence in education, Educational data mining
Subject terms: Mathematics and computing, Psychology, Psychology
Introduction
With the rapid progress and technological advancements in education, internet services have gained widespread adoption and implementation across major universities, as well as primary and secondary schools. Consequently, online education has undergone substantial growth. Online education offers the flexibility of learning at any time and from anywhere, breaking free from the constraints of traditional learning environments, and granting access to a vast array of educational resources. However, it also presents certain challenges. One notable challenge is the inherent separation between students and teachers in the virtual field of online learning1. This physical divide makes it arduous for teachers to gauge the level of student engagement in the learning process, a difficulty that becomes increasingly pronounced as the number of learners rises2. Compared to face-to-face instruction, the spatial and temporal detachment in online learning hinders effective communication and interaction between learners and educators, giving rise to a recurring sense of emotional disconnection. This emotional disconnect significantly impacts learners’ online educational experiences and their subsequent outcomes. Therefore, from a pedagogical standpoint, it becomes imperative for educators to automatically discern students’ emotional engagement levels during online learning, furnish timely feedback, and proactively undertake necessary measures to actively involve students in the learning journey.
As per the theory of learner engagement, learner engagement stands as the most effective predictor of student development, the level of learner engagement and emotions share a close association with academic performance3,4. Several studies have demonstrated that learner engagement correlates with the extent of psychological investment in activities and can serve as a reliable predictor of learning outcomes5. Presently, the widely accepted definition of learner engagement, proposed by Fredricks in 2004, encompasses three dimensions: emotional engagement, behavioral engagement, and cognitive engagement. Among these dimensions, emotional engagement pertains to the degree and nature of learners’ positive or negative emotional responses to teachers, peers, school, and academics6. Learners who experience a sense of enjoyment tend to be more motivated in tackling challenging problems7. Behavioral engagement focuses on learners’ active involvement in social, academic, and extracurricular activities throughout their educational journey, emphasizing quantity over quality in terms of engagement in learning activities6. Cognitive engagement relates to the level of knowledge construction during the learning process8. Notably, Pekrun et al.’s research suggests that emotional engagement serves as a prerequisite for both cognitive and behavioral engagement. In the context of online learning, the analysis and feedback regarding learners’ emotional engagement assume a critical role. This is because learners’ emotional engagement can serve as an indicator of their willingness to learn, their needs, and their motivation throughout the learning process9. Experienced educators can monitor students’ engagement by observing their facial expressions during instruction and adapt their teaching strategies and content accordingly. Facial expressions serve as indicators of a person’s emotional engagement state.
Considering the limited sustained attention span of typical students, the level of emotional engagement tends to fluctuate at different stages during a class. Attention span refers to the duration of time an individual can concentrate on a task10. Wilson and Korn’s literature review highlighted that students’ attention tends to decline after approximately 10–15 min11. Several studies have investigated attention span, exploring various aspects such as the relationship between note-taking quantity and attention span1,9,12, the correlation between the amount of retained information in students’ memory and lecture duration13, and the connection between attention span and heart rate per minute3. Guo’s research indicated that students’ engagement remains high for the first 6 min when watching online learning videos, but subsequently declines rapidly2. Therefore, the implementation of a feedback system that automatically analyzes learners’ emotional engagement at different time intervals can assist teachers in summarizing their teaching plans and promptly updating their instructional strategies.
From a methodological standpoint, researchers have traditionally relied on manual coding and conventional machine learning methods to identify learners’ emotional engagement in online learning. However, manual coding of datasets is a time-consuming process and is often plagued by issues such as sample imbalance and limited sample size. Furthermore, traditional machine learning methods lack robustness, which has impeded both theoretical and practical advancements in this field. In recent years, the Vision Transformer-based network models have become the state-of-the-art technology in image processing technology, and have made revolutionary achievements in image classification. They address many of the limitations associated with traditional approaches. However, the application of Vision Transformer-based models for detecting learners’ emotional engagement in online learning has not been fully optimized or extensively explored. Moreover, the development of Vision Transformer-based detection and feedback systems specifically tailored to the context of online learning is still needed. This presents challenges in the field of educational research and practice, as researchers and educators strive to leverage the potential of these state-of-the-art technologies.
Consequently, this study seeks to accomplish several objectives: (1) Assess the capability of an optimized Vision Transformer model to infer emotional engagement from facial images captured by a camera. (2) Investigate the notable variations in emotional engagement among learners at different stages of the online learning process. (3) Explore the relationship between emotional engagement and learning outcomes. These studies will offer educators and learners valuable methodological and theoretical insights, enhancing their understanding of the significance of emotional engagement in promoting effective learning.
Methods
Ethics statement
This study, “Understanding the Impact of Emotional Engagement on Learning Outcomes in Online Education: An Automated Analysis Approach,” has been approved by the Ethics Committee of the Key Laboratory of Modern Teaching Technology, Ministry of Education, under approval number L20250904-02, on September 4, 2025. This project is conducted strictly in accordance with relevant laws and regulations and complies with the ethical guidelines established by the Declaration of Helsinki. All participants signed written informed consent forms before participating in the study and were fully informed of the research objectives and the use of their facial images in academic publications. All facial images included in the study were taken with the explicit consent of the participants, who agreed to their use in scientific research and publication. For participants who did not consent to the use of identifiable images, all facial data was anonymized to ensure their privacy.
Research background and participants
Participants were recruited at a university in western China, involving 40 junior undergraduate students (M-age = 20.9) from various majors, excluding psychology and Marxist philosophy. The participants, consisting of 20 males and 20 females, provided informed consent after a thorough explanation of the study. The recruitment for the experiment began on September 12, 2025 and ended on September 19, 2025. Informed consent forms were distributed to all 40 volunteers who participated in the experiment, and all 40 volunteers agreed and signed the informed consent forms. The use of data and facial information in this experiment has been agreed and approved by all volunteers.
A dedicated laboratory setting was prepared to ensure an uninterrupted environment for the participants during their involvement in the study. The lighting conditions in the laboratory were not manipulated and comprised natural light from both indoor fluorescent lamps and outdoor sunlight. To capture facial video data of the participants during their online learning sessions, a computer equipped with a high-definition camera was set up in the laboratory. The participants’ online learning processes were recorded using the EV screen recording software, combined with the high-definition camera.
For the experimental phase, three approximately 10-minute instructional videos were selected from the Chinese University MOOC website. The videos were titled ‘The Psychology of Love’ ‘Innovative Thinking Behind Open Minds’ and ‘Fundamental Principles of Marxism’. All three online courses were classified as national quality courses offered by the Chinese University MOOC. Corresponding test questions were designed for each course to evaluate the participants’ learning outcomes. The test questions we utilized were carefully selected from the supplementary test materials provided after the MOOC courses. These test questions were evaluated by two experts in the respective field of the course, who confirmed that they accurately reflect students’ learning outcomes. The test questions are scored out of 10 and consist of four multiple-choice questions, two fill-in-the-blank questions, and one short-answer question. Participants were required to complete the respective test questions after watching each video to obtain their final test scores. The overall data collection process is illustrated in Fig. 1.
Fig. 1.
Flowchart for constructing a dataset of emotional engagement in online learning.
In this data collection experiment, a total of 120 segments of online learning videos, each approximately 10 min in length, were collected. Building on the research conducted by Whitehill et al., which compared the usefulness of video-based sequences and image-based methods in recognizing engagement levels, this study found that image-based methods had relatively higher accuracy compared to video-based methods. This suggests that engagement is more of a spatial concept rather than a spatiotemporal one14. Based on these researches, we obtained a total of 71,185 images for further experimentation. Table 1 presents the number and proportion of images associated with each engagement level.
Table 1.
Distribution of the number of images for three levels of emotional engagement.
| Learn emotional engagement | Highly engaged | Moderately engaged | Disengaged |
|---|---|---|---|
| Label | 3 | 2 | 1 |
| Number of pictures | 16,515 | 38,468 | 16,202 |
| The proportion of the number of pictures | 23.20% | 54.04% | 22.76% |
Research design
The research design consists of seven stages to address the objectives and research questions. Here is a detailed description of each stage.
Stage 1: Facial data is obtained from online learning environments using a webcam and stored in a database.
Stage 2: The collected data undergoes a cleaning process using Camtasia Studio video editing software to remove any data that does not meet the experimental requirements. This step ensures that only valid and relevant data is retained for further analysis. (Camtasia is a software package produced by TechSmith in the United States that integrates computer screen recording and video editing. It also includes built-in features for Camtasia recorder, Camtasia Studio editor, Camtasia menu maker, Camtasia theater, Camtasia player, and Screencast).
Stage 3: Expert coders encode the emotional engagement data based on the theory of emotional engagement. The coders carefully analyze and label the collected data with the appropriate emotional engagement categories, applying their expertise and knowledge in emotional engagement research.
Stages 4 and 5: These stages involve the exploration of the first research question. The encoded emotional engagement data is utilized to train and evaluate optimized deep learning models. Through various iterations, the models are refined and adjusted to improve their performance in accurately identifying and classifying emotional engagement in the collected facial data.
Stage 6: The trained model with the best parameters is employed to identify and assign emotional engagement labels to unlabeled facial data. This allows for the automatic detection and classification of emotional engagement in previously unlabeled data.
Stage 7: Statistical analysis methods, such as Pearson correlation analysis, are utilized to address the second and third research questions. The collected data, including the labeled emotional engagement data and associated learning outcomes, are analyzed to examine the relationships between emotional engagement and learning outcomes. Statistical techniques are employed to determine the strength and significance of these relationships.
In the second stage, the cleaning process involves removing video data that does not meet the experimental requirements. Additionally, the videos are segmented into multiple video segments, ensuring that each segment contains only one category of emotional engagement. Camtasia Studio video editing software is employed for this purpose. Invalid video segments, where the learner’s face is obscured or cannot be detected, are excluded during the segmentation process. Figure 2 illustrates the process of video segmentation using Camtasia Studio software, and Fig. 3 displays the results of the video segmentation. In total, 1067 valid video segments were extracted from the initial 120 video segments for analysis in this study. We extracted one frame image every 5 frames from each video segment, excluding images that did not correspond to the engagement level of the video segment. The extracted images were then assigned the engagement level corresponding to their respective video segments. For instance, images extracted from highly engaged video segments were also assigned a highly engaged level. After data cleaning and annotation, a total of 71,185 images were obtained.
Fig. 2.
Video trimming using camtasia studio software.
Fig. 3.
Video segmentation results.
Coding scheme
In the field of online learning, learners’ facial expressions generate a substantial volume of data, which poses challenges in terms of the time required for manual coding. To overcome this methodological challenge, we have developed an optimized deep learning model. To effectively train and evaluate this model, we have devised an encoding scheme for emotional engagement levels in online learning, drawing upon the theory of learners’ emotional engagement15,16. The encoding scheme, presented in Table 2, categorizes learners’ emotional engagement into three distinct levels: highly engaged, moderately engaged, and disengaged. Each emotional engagement category is thoroughly described in Table 2, providing detailed insights into the characteristics and attributes associated with each level of emotional engagement.
Table 2.
Coding scheme for learning emotional engagement in online learning.
| Class | Head features | Eye features | Facial expression features |
|---|---|---|---|
| Highly engaged | Head upright or inclined forward | Staring at the screen, eyes unconsciously widening, increased distance between upper and lower eyelids | Surprise, joy, focus, enthusiasm, and other positive expressions. |
| Moderately engaged | Head generally upright or slightly tilted to the left or right | Line of sight positioned within the screen area, eyes open normally, no change in the distance between upper and lower eyelids | Calm, neutral and other neutral expressions |
| Disengaged | Head not upright and significant tilt to the left or right | The line of sight is positioned at the edge of the screen area or outside the screen area, eyes slightly closed or even completely closed, and the distance between the upper and lower eyelids decreases | Bored, tired, indifferent, and other negative expressions |
In the third phase, to ensure the quality and credibility of the dataset constructed for learners’ engagement, a crowdsourcing approach was employed for data annotation. Three students with academic backgrounds in educational technology were recruited as data annotators, and they underwent training to familiarize themselves with the relevant definitions of learners’ engagement states, the annotation tools, and the specific definitions of the three engagement labels. During the training, a portion of annotated data was provided for practice, and discussions and Q&A sessions were organized to address any issues or questions encountered by the annotators. Guidance and clarification were provided to resolve doubts or disagreements and to ensure a consensus among the annotators. Based on the performance of the annotators during training, they were confirmed as data annotators to participate in the annotation task.
To ensure data annotation validity and reliability, we adopted the consistency check method proposed by Kaur et al., using Kendall’s coefficient of agreement13. The results of Kendall’s coefficient of agreement for the data annotation by all annotators revealed a high level of consistency, with a Kendall’s coefficient of agreement of 0.889 (
). This high reliability and accuracy of the data annotation confirm the validity and suitability of the annotated data for training and evaluating the online learning emotional engagement recognition model.
Automatic engagement detection based on vision transformer network and transfer learning
In the fourth and fifth stages of the study, an analysis of the encoding scheme for the emotional engagement data was conducted, revealing an issue of class imbalance within the collected dataset. Additionally, due to the smaller number of participants and a larger number of training samples per participant, there was limited diversity in the data, leading to a smaller intra-class distance and a larger inter-class distance. To address these challenges and improve the model’s performance, robustness, and generalization capabilities, the study considered the possibility of pretraining the Vision Transformer network model using the DAiSEE dataset17. The pretrained model’s weights would then be utilized as the initial weights for further training using the self-built emotional engagement dataset. By leveraging the pretrained model and incorporating it into the training process, it was anticipated that the model’s performance and generalization abilities could be enhanced, leading to improved results in recognizing and classifying emotional engagement in the online learning context. This approach aimed to address the issue of limited data diversity and enhance the overall effectiveness of the model.
The Vision Transformer network model, introduced by Dosovitskiy et al. in 2020, is a notable innovation that adapts the Transformer architecture, originally designed for natural language processing tasks, to the field of computer vision18. This model represents a self-attention-based approach to image classification. In contrast to traditional convolutional neural networks, the Vision Transformer does not employ convolutional layers but instead relies exclusively on self-attention mechanisms to extract relevant features from images. The architecture of the Vision Transformer model is visualized in Fig. 4, showcasing the arrangement of self-attention layers and feed-forward neural networks. Through the use of self-attention mechanisms, the model captures dependencies between different regions of an image, enabling it to effectively process and understand the visual information. This innovative approach has shown promising results in various computer vision tasks and has the potential to significantly impact the field of image classification.
Fig. 4.
Vision transformer model architecture diagram.
To capture more comprehensive and detailed feature information, the Vision Transformer model employs a multi-head self-attention mechanism. This mechanism involves running multiple self-attention mechanisms simultaneously, and then combining their outputs through concatenation and linear transformation to achieve the desired output dimensionality. The calculation formulas for the multi-head self-attention are provided in Eqs. (1) and (2) :
![]() |
1 |
where
,
, and
represent the query vector matrix, key vector matrix, and value vector matrix, respectively. The MultiHead function concatenates the outputs of each individual self-attention head, denoted as
, for
(the total number of heads).
is the weight matrix used for linear transformation. Each self-attention head
performs the following calculations:
![]() |
2 |
where
,
, and
are the learnable weight matrices for the query, key, and value projections of the
th self-attention head. The
function computes the attention scores and applies them to the values to obtain the attended output. During the training process, we first pre-trained the Vision Transformer network model on the DAiSEE dataset, and then fine-tuned it on our self-built dataset of learners’ learning engagement.
Data analysis and automated feedback model
To address the first research question, In the original methodology, we initially performed ‘10-fold cross-validation’ on the image-level data. However, this approach could lead to an overestimation of model performance because it is possible for data from the same participant to appear in both the training and test sets, violating the independence of the training and testing data.
To address this issue and provide a more accurate performance evaluation, we adopted a ‘subject-based k-fold cross-validation’ approach. In this method, the 40 participants were divided into 5 folds, with each fold containing 8 participants. For each fold, the data from 8 participants were used as the test set, while the data from the remaining 32 participants were used for training. This ensured that the data from each participant was either in the training set or the test set, but never in both, effectively eliminating the risk of data leakage and ensuring a fair evaluation of the model’s performance.
This modification was made due to the potential difficulties in recalling and organizing all 40 participants for repeated testing in real-world settings. Therefore, the ‘K-fold cross-validation by subject’ method was chosen as a more practical solution. This method ensures that the performance metrics obtained from the cross-validation process reflect the model’s true ability to generalize to new, unseen participants, rather than being inflated by repeated data from the same participants.
In this approach, the model was evaluated using performance metrics such as accuracy, precision, recall, and F1-score across all folds. This more rigorous evaluation ensures that the model’s performance is both reliable and realistic, providing a true measure of its generalization capabilities.
To avoid overestimating the performance due to data leakage, we employed a ‘subject-based k-fold cross-validation’ method, where the 40 participants were divided into 5 folds, with each fold containing 8 participants. This method ensured that the data from the same participant were not used in both the training and test sets, thus providing a more accurate and unbiased estimate of the model’s performance.
To address the second research question, an automatic detection and feedback system was developed. The process flowchart of the system is illustrated in Fig. 5. This system facilitated the analysis of learner engagement recognition and variations during online learning. The analysis results were communicated to teachers in a timely manner, enabling them to better understand and respond to learners’ engagement levels. In the seventh stage, Pearson correlation analysis was employed to investigate the relationship between learners’ emotional engagement and their learning outcomes, providing insights into the impact of emotional engagement on learning effectiveness.
Fig. 5.
The automatic detection and feedback system.
Quizzes validation, reliability, and counterbalancing
Quizzes validation
To ensure the validity of the tests, we performed item analysis on each test. The purpose of item analysis was to assess whether each test item could effectively differentiate between participants with high and low performance, thereby ensuring that the tests were indeed effective in measuring the intended affective engagement and academic performance. Each item was designed according to specific learning objectives to ensure that the tests comprehensively reflected participants’ affective engagement and learning outcomes.
Reliability
The internal consistency of the quizzes was assessed using Cronbach’s α. The Psychology of Love quiz achieved α = 0.89, the Innovative Thinking quiz α = 0.82, and the Marxism quiz α = 0.77, indicating acceptable reliability across all three courses.
Resolution of annotation disagreement and item review
To ensure annotation quality, all video segments were independently coded by two trained expert annotators. After the initial round of coding, disagreements between annotators were identified through item-level comparison. When inconsistencies occurred, the annotators first engaged in a structured discussion to review the specific video segments and justify their coding decisions. If consensus could not be reached through discussion, a third senior annotator acted as an adjudicator and made the final determination.
In addition, item-level reliability was examined during this process. For quiz items or annotation codes that demonstrated low inter-annotator agreement, the items were reviewed and clarified to remove ambiguities. No items were removed because all items achieved acceptable reliability after revision and consensus adjudication. This multi-step procedure ensured that the final annotation set met the required standard of inter-rater reliability.
Counterbalancing
To eliminate the influence of order and topic effects on the results, we implemented counterbalancing and randomization on the presentation order of the tests. Specifically, we used a counterbalancing method to randomize the order in which each participant received the test. This ensured that each participant received the test items in a different order, thus avoiding the interference of order effects on the results. Simultaneously, we also randomized the different topics involved in the test to ensure that the order of the topics did not affect the participants’ performance.
Through this randomization and counterbalancing design, we effectively controlled the confounding effects that might arise from the test order and topic order, ensuring the reliability of the results and ensuring that the relationship between affective engagement and learning outcomes is not interfered with by external order effects.
Results
Repeat analyses on subject-split data
Given the challenges of recalling all 40 participants and organizing repeated experiments in real-world settings, we opted for K-fold Cross-Validation by Subject, a more practical solution. In this approach, the 40 participants were divided into 5 folds, with each fold containing data from 8 participants. For each fold, the data from 8 participants were used as the test set, while the data from the remaining 32 participants were used for training. This method ensured that the model was trained on data from one set of participants and evaluated on a completely separate set, providing a more realistic and generalizable evaluation of the model’s performance.
This subject-based cross-validation approach also provides a more efficient alternative to leave-one-out cross-validation, as it allows for fewer iterations while still maintaining data independence between the training and testing phases. It ensures that the model is not biased by repeated data from the same subjects, making the evaluation more reliable and representative of the model’s real-world applicability.
In this study, the original dataset contained 71,185 images. These images were categorized into three groups based on emotional engagement: Highly Engaged (16,515 images, approximately 23.20%), Moderately Engaged (38,468 images, approximately 54.04%), and Disengaged (16,202 images, approximately 22.76%). We conducted 5-fold Cross-Validation by Subject (each time with 8 participants as the test set) to acquire and analyze accuracy, precision, and recall data across the five folds. The results for each fold include Accuracy, Precision, and Recall, and the final valueswill be the mean and standard deviation. All quiz items and annotation codes met the acceptable reliability threshold after discussion-based reconciliation, and therefore no items were removed from the analysis. The simplified data is in Table 3.
Table 3.
K-fold cross-validation by subject fold results summary (simplified data).
| Fold number | Accuracy(%) | Precision(%) | Recall(%) |
|---|---|---|---|
| Fold 1 | 88.75 | 86.56 | 85.21 |
| Fold 2 | 94.51 | 85.58 | 94.70 |
| Fold 3 | 92.32 | 93.66 | 93.32 |
| Fold 4 | 90.99 | 91.01 | 87.12 |
| Fold 5 | 86.56 | 92.08 | 86.82 |
| Mean | 90.62 | 89.78 | 89.43 |
| Std Dev | 3.09 | 3.53 | 4.27 |
The results indicate that the model’s performance was lower than the initial image-based cross-validation, as expected, due to the more stringent evaluation process. The mean accuracy across all 5 folds was ‘90.62%’, with a standard deviation of ‘3.09%’. The precision and recall metrics were also consistent, with average values of ‘89.78%’ and ‘89.43%’, respectively. These results indicate that the model maintains a stable performance when evaluated using independent subject data, ensuring that the reported performance metrics are not inflated.
In Table 4, To ensure full transparency, the Cronbach’s α values for each quiz are reported as follows: the Psychology of Love quiz demonstrated an internal consistency of α = 0.89 (naturally higher), the Innovative Thinking quiz showed α = 0.82 (moderate internal consistency), and the Marxism quiz yielded α = 0.77 (indicating acceptable internal consistency). All three values fall within the acceptable reliability range, indicating that the quiz items consistently measured the intended learning constructs.
Table 4.
The cronbach’s α values for each quiz.
| Course | Cronbach’s α |
|---|---|
| Psychology of love | 0.89 |
| Innovative thinking | 0.82 |
| Marxism | 0.77 |
By utilizing ‘K-fold cross-validation by subject’, we were able to present a more reliable evaluation of the model’s true generalization ability, which is essential for its practical application in real-world scenarios.
Correlation analysis
Performance metrics
The results of the analysis showed that there was a significant positive correlation between emotional engagement and learning outcomes (measured by quiz scores). The mean accuracy of the model across all 5 folds was 90.62%, with a standard deviation of 3.09%. The precision and recall values were 89.78% and 89.43%, respectively. These results suggest that the model was able to effectively predict emotional engagement based on the quiz data.
Correlation between engagement and scores
The relationship between emotional engagement and learning outcomes was analyzed using Pearson’s correlation. The correlation coefficient was 0.68, indicating a moderate to strong positive relationship between engagement levels and quiz performance. This suggests that students who were more emotionally engaged in the learning process tended to perform better in the quizzes. However, it is important to emphasize that this is a correlational analysis and not a causal one, so no claims can be made about the direction of influence. The observed relationship should not be interpreted as one variable causing the other.
Controlling for order and topic effects
To ensure the validity of the results, potential order effects and topic effects were controlled through the use of randomization and counterbalancing. The randomization of quiz order ensured that no specific sequence of questions influenced the participants’ responses. The counterbalancing technique further minimized any bias from the sequence of topics covered in the quizzes. As a result, the observed correlations between engagement and quiz performance are not confounded by these extraneous factors.
Optimized vision transformer model identify learners’ emotional engagement in online learning
The purpose of this experiment was to evaluate the performance of the optimized Vision Transformer model and Transfer Learning model in detecting emotional engagement. Firstly, the Vision Transformer network model was pre-trained using the DAiSEE dataset to enhance its feature representation and generalization ability. To assess the impact of Transfer Learning on model performance, comparative experiments were conducted to evaluate the accuracy of models with and without pre-training. The term ‘without pre-training’ refers to the Vision Transformer network model trained directly on the self-built learning engagement dataset without prior pre-training on the DAiSEE dataset. The results of the comparative experiments are presented in Table 5.
Table 5.
Comparison of experimental results (%) before and after transfer Learning.
| Accuracy | Macro-recall | Macro-precision | Macro-F1 | |
|---|---|---|---|---|
| Vision transformer | 91.79 1.54 |
91.48 1.72 |
93.04 1.46 |
91.99 1.55 |
| Vision transformer + transfer learning | 93.82 1.20 |
93.78 1.04 |
94.26 0.80 |
93.95 1.12 |
Table 5 presents a comparison of model performance before and after the application of Transfer Learning. As shown in the table, incorporating Transfer Learning leads to consistent improvements across all evaluated metrics, including accuracy, precision, recall, and F1-score. These results indicate that fine-tuning the pre-trained model on the target dataset effectively enhances classification performance and improves the model’s ability to generalize across participants.
Table 6 Compares the performance of the proposed vision transformer–based approach with several baseline models. The results demonstrate that the vision transformer consistently outperforms the baseline architectures across all performance metrics. This comparison highlights the effectiveness of the vision transformer architecture for modeling emotional engagement in online learning scenarios. Additionally, we compared the classification performance of our proposed optimized vision transformer + Transfer learning model with other models in the task of emotional engagement detection. The comparison methods were as follows: Gabor + SVM. (42) decision tree, (41) ResNet + TCN16. LDP-KPCA-DBN13. The algorithm is described in Sect. 2.2. Table 6 presents the comparison of the performance results on the self-built dataset between our algorithm and the baseline models.
Table 6.
Classification results (%) of emotional engagement.
| Class | Gabor + SVM | Decision tree | ResNet + TCN | LDP-KPCA-DBN | Vision Transformer + Transfer Learning |
|---|---|---|---|---|---|
| Highly engaged | 77.52 1.61 |
89.22 1.13 |
89.93 1.72 |
95.52 1.68 |
95.97 1.01 |
| Moderately engaged | 73.65 1.82 |
73.17 1.46 |
82.24 1.16 |
85.83 1.41 |
92.46 0.79 |
| Disengaged | 82.19 1.81 |
83.15 1.72 |
84.02 1.19 |
83.39 1.02 |
93.03 1.29 |
| Mean | 77.78 1.52 |
81.84 0.94 |
85.40 1.29 |
88.25 1.13 |
93.82 1.20 |
From Table 6, it can be observed that the Vision Transformer–based approach consistently outperforms other baseline models across all engagement categories.
Students’ engagement vary during online learning
Figure 6 illustrates the overall automatic recognition results of learners’ engagement. The results of the self-built dataset of emotional engagement reveal interesting patterns throughout the online course. It can be observed from the figure that the majority of learners’ engagement is categorized as ‘moderate engagement’ throughout the entire duration of the course. However, as the learning session progresses, there is a gradual decline in learners’ engagement. The proportion of ‘high engagement’ and ‘moderate engagement’ decreases, while the proportion of ‘low engagement’ increases. Notably, around the 6-minute mark, there is a more pronounced increase in the proportion of ‘low engagement’, indicating a decline in learners’ overall engagement during that period. Interestingly, a slight improvement in learners’ engagement was observed towards the end of the course, despite the course nearing its conclusion. This suggests a potential rebound or revitalization of engagement towards the end of the learning session. These insights provide essential information about the fluctuations in learners’ engagement levels during online learning and highlight specific time periods when learners may experience changes in their level of engagement. Understanding these patterns can be valuable for educators and instructional designers to optimize learning experiences and implement interventions to sustain and enhance learner engagement throughout the course.
Fig. 6.
The automatic detection results of all learners.
Figure 7 presents the overall manual annotation results of learners’ engagement during online learning. The experimental results demonstrate the effectiveness of our proposed online learning sentiment recognition model in accurately identifying and categorizing learners’ engagement levels. The manual annotation results are generally consistent with the automated detection results, indicating that our algorithm is effective in capturing the overall patterns of learners’ changes throughout the online learning process. The alignment between manual annotations and automated detection results allows for a deeper analysis of consistent patterns in learners’ engagement. This highlights the robustness and reliability of our algorithm in providing timely and accurate understanding of learners’ engagement states. Moreover, the effectiveness of our algorithm becomes particularly evident when comparing it to the automated detection and feedback processes. Our algorithm provides a comprehensive and nuanced understanding of learners’ engagement, surpassing the limitations of automated detection alone. It enables us to gain valuable insights into the dynamics of learners’ engagement, facilitating timely and effective interventions to enhance the learning experience. Overall, the experimental results from Fig. 7 affirm the effectiveness of our online learning sentiment recognition model in identifying learners’ engagement during online teaching. This supports the notion that our algorithm is a valuable tool for gaining a deeper understanding of student states and optimizing the online learning environment.
Fig. 7.
The manual annotation results of all learners.
The relationship between emotional engagement and learning outcomes
In Table 7, the statistical analysis results are presented for the automatic detection results, manual annotations, average of manual annotations, and test results. The metrics included in the table are the mean and standard deviation (SD). For the automatic recognition results, the mean values on three online courses are 2.470 ,1.931 ,1.651, respectively, indicating the average emotional engagement level of all learners as determined by the online learning emotion recognition model. The SD represents the variability in the recognition results. The manual annotations by three annotators provide an additional perspective on learners’ emotional engagement. The average of manual annotations combines the individual annotations from the three annotators, providing a more comprehensive assessment of learners’ emotional engagement. Lastly, the test scores represent the learners’ performance on the quiz questions designed to assess their learning effectiveness. The mean and SD values of the test results are provided. These statistical analysis results offer valuable insights into the agreement between the automatic recognition results and manual annotations, as well as the learners’ learning outcomes. They provide a quantitative assessment of the emotional engagement levels, helping to validate the effectiveness of the online learning emotion recognition model and evaluate the learners’ understanding and retention of the video content.
Table 7.
Statistical analysis of emotional engagement and learning outcomes.
| Course ID | Annotator1 | Annotator2 | Annotator3 | Average of manually annotated results | Automatic detection |
Test score | |
|---|---|---|---|---|---|---|---|
| 1 | Mean | 2.782 | 2.224 | 2.565 | 2.523 | 2.470 | 8.100 |
| SD | 0.323 | 0.329 | 0.410 | 0.377 | 0.323 | 1.370 | |
| N | 40 | 40 | 40 | 120 | 40 | 40 | |
| 2 | Mean | 2.10 | 1.98 | 1.89 | 1.986 | 1.931 | 6.300 |
| SD | 0.201 | 0.216 | 0.260 | 0.226 | 0.245 | 0.823 | |
| N | 40 | 40 | 40 | 120 | 40 | 40 | |
| 3 | Mean | 1.73 | 1.80 | 1.63 | 1.716 | 1.651 | 4.400 |
| SD | 0.122 | 0.174 | 0.158 | 0.175 | 0.121 | 0.516 | |
| N | 40 | 40 | 40 | 120 | 40 | 40 |
Table 8 Presents the results of the pearson correlation analysis conducted to examine the relationship between learners’ emotional engagement and their learning outcomes. The variables analyzed include emotional engagement through manual annotation and automatic detection, as well as learning outcomes. Whether it is automatic emotion recognition or manual annotation results, there is a significant relationship with the learning outcomes. Specifically, the coefficient correlations between automatic detection and test score were r = 0.860**,0.664*, and 0.707*. The coefficient correlations between manual annotation and test score were r = 0.799**,0.657*, and 0.636*. This suggests that the correlation between emotional engagement and learning outcomes is statistically significant. The results of the pearson correlation analysis support the hypothesis that learners’ emotional engagement has a positive influence on their learning outcomes. This experimental result emphasizes the importance of emotional engagement in online learning and suggests that learners who are more emotionally engaged tend to achieve better learning outcomes. Overall, the results of the pearson correlation analysis provide evidence of a significant and positive correlation between learners’ emotional engagement and their learning outcomes, highlighting the relevance of emotional engagement in the context of online learning.
Table 8.
The pearson correlation analysis between test score, automatic detection and manual annotation.
| Course ID | Manual annotation | Automatic detection | Test score | |
|---|---|---|---|---|
| 1 | Manual annotation | 1 | ||
| Automatic detection | 0.869** | 1 | ||
| Test score | 0.799** | 0.860** | 1 | |
| 2 | Manual annotation | 1 | ||
| Automatic detection | 0.876** | 1 | ||
| Test score | 0.657* | 0.664* | 1 | |
| 3 | Manual annotation | 1 | ||
| Automatic detection | 0.887** | 1 | ||
| Test score | 0.636* | 0.707* | 1 |
N = 40, *p < 0.05, **p < 0.01, **p < 0.001.
Discussion
This study investigated the relationship between emotional engagement and learning outcomes in online courses using automated facial-expression analysis supported by Vision Transformer models. The findings demonstrate a clear positive association between learners’ emotional engagement and their quiz performance, suggesting that emotional engagement provides meaningful insight into how learners interact with online content. By applying Transfer Learning, the model achieved stable performance across subject-based validation, indicating that the proposed approach can generalize across different individuals despite limited data quantities.
Importantly, the temporal analysis revealed a general decline in emotional engagement over the duration of the courses, with a slight recovery toward the end. This pattern aligns with prior studies suggesting that sustained participation in online environments often leads to reduced attentional and emotional investment. However, the present work extends this understanding by quantifying engagement using automated, fine-grained behavioral indicators rather than self-report measures.
At the same time, several limitations must be acknowledged. The three MOOCs included in the study cover heterogeneous content domains, which may influence how learners emotionally respond to different topics. This content heterogeneity introduces potential confounding effects, especially when interpreting cross-course trends, and limits generalization. Although subject-based cross-validation addressed participant-level variance, topic-level variance was not fully controlled. Future work should therefore incorporate more homogeneous course materials or conduct course-level stratified analyses.
Despite these limitations, the study provides evidence that automated engagement detection, combined with advanced deep learning models, can offer reliable indicators for understanding online learning behaviors. These findings support the broader use of affective analytics to enhance adaptive learning systems and improve instructional design in digital education environments.
Conclusions
This study demonstrates that emotional engagement, detected through automated facial-expression analysis, is positively associated with learning performance in online courses. The Vision Transformer model, enhanced through Transfer Learning, provided reliable recognition under subject-based validation and supported the analysis of engagement trends.
The results highlight the potential of integrating affective analytics into online learning platforms to better monitor learners’ engagement and support personalized interventions. Nonetheless, the heterogeneity of course topics and the limited number of participants impose constraints on generalizability. Future research should employ more standardized course content, larger samples, and longitudinal designs to further validate the proposed approach.
Author contributions
Guanyu Chen is responsible for article writing and revision, as well as communication work.Guangxin Han is responsible for data calculation and processing, as well as icon creation and modification.Juan Niu has provided academic research and theoretical support for this study.Juhou He is the administrator of this project and provides guidance throughout the research process.
Funding
National Natural Science Foundation of China (No. 62177032), “Research on the Autonomous Training and Evaluation Model for Pre-service Teachers’ Classroom Teaching Expression Competence.”
Data availability
The datasets generated and analyzed during the current study are not publicly available due to ethical restrictions related to identifiable facial data, but are available from the corresponding author upon reasonable request and subject to approval by the institutional ethics committee.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Venton, B. J. & Pompano, R. R. Strategies for enhancing remote student engagement through active learning. Anal. Bioanal. Chem.413 (6), 1507–1512 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Pabba, C. & Kumar, P. An intelligent system for monitoring students’ engagement in large classroom teaching through facial expression recognition. Expert Syst., 39(1), (2022).
- 3.Sümer, Ö. et al. Multimodal Engagement Analysis from Facial Videos in the classroom (IEEE Transactions on Affective Computing, 2021).
- 4.Baker, R. S. et al. Better to be frustrated than bored: the incidence, persistence, and impact of learners’ cognitive–affective States during interactions with three different computer-based learning environments. Int. J. Hum. Comput. Stud.68 (4), 223–241 (2010). [Google Scholar]
- 5.Bian, C. et al. Spontaneous facial expression database for academic emotion inference in online learning. IET Comput. Vision. 13 (3), 329–337 (2019). [Google Scholar]
- 6.Pekrun, R. et al. Achievement emotions and academic performance: longitudinal models of reciprocal effects. Child Dev.88 (5), 1653–1670 (2017). [DOI] [PubMed] [Google Scholar]
- 7.Whitehill, J. et al. The faces of engagement: automatic recognition of student engagementfrom facial expressions. IEEE Trans. Affect. Comput.5 (1), 86–98 (2014). [Google Scholar]
- 8.Fredricks, J. A., Blumenfeld, P. C. & Paris, A. H. School engagement: potential of the concept, state of the evidence. Rev. Educ. Res.74 (1), 59–109 (2004). [Google Scholar]
- 9.Bosch, N., Chen, Y. & D’Mello, S. It’s written on your face: detecting affective states from facial expressions while learning computer programming[C]. Intelligent Tutoring Systems: 12th International Conference, ITS (2014).
- 10.D’Mello, S. K., Craig, S. D. & Graesser, A. C. Multimethod assessment of affective experience and expression during deep learning. Int. J. Learn. Technol.4 (3–4), 165–187 (2009). [Google Scholar]
- 11.Means, B. & Toyama Yuki and Murphy, Robert and Bakia, Marianne and Jones, Karla. Evaluation of Evidence-Based Practices in Online Learning: A Meta-Analysis and Review of Online Learning Studies. Project Report. Centre for Learning Technology[M]. Assoc. Learn. Technol., (2009).
- 12.Kaur, A. et al. Prediction and localization of student engagement in the wild[C]. 2018 digital image computing: techniques and applications (DICTA). IEEE, 1–8. (2018).
- 13.Kamath, A., Biswas, A. & Balasubramanian, V. A crowdsourced approach to student engagement recognition in e-learning environments[C]. 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1–9. (2016).
- 14.Mukhopadhyay, M. et al. Facial emotion detection to assess Learner’s State of mind in an online learning system[C]. Proceedings of the 5th international conference on intelligent information technology. (2020).
- 15.Gupta, A. et al. Daisee: towards user engagement recognition in the wild. (2016).
- 16.Alyuz, N. et al. Towards an emotional engagement model: Can affective states of a learner be automatically detected in a 1:1 learning scenario?[C]. 24th ACM Conference on User Modeling, Adaptation and Personalization (UMAP). (2016).
- 17.D’Mello, S. K. & Graesser, A. Multimodal semi-automated affect detection from conversational cues, gross body language, and facial features. User Model. User-Adapt. Interact.20, 147–187 (2010). [Google Scholar]
- 18.Skinner, E. A. & Belmont, M. J. Motivation in the classroom: reciprocal effects of teacher behavior and student engagement across the school year. J. Educ. Psychol.85 (4), 571–581 (1993). [Google Scholar]
- 19.Schmieder, A. A glossary of educational reform. J. Teacher Educ.24 (1), 55–62 (1973). [Google Scholar]
- 20.Fisher, C. W. et al. Teaching behaviors, academic learning time, and student achievement: an overview. J. Classr. Interact.17 (1), 2–15 (1981). [Google Scholar]
- 22.Connell, J. P., Wellborn, J. G. & Competence Autonomy, and relatedness: A motivational analysis of self-system processes. J. Personal. Soc. Psychol., (65):43–77. (1991).
- 25.Kahu, E. R. & Nelson, K. Student Engagement in the Educational Interface: Understanding the Mechanisms of Student success Vol. 37, 58–71 (Higher education research & development, 2018).
- 27.Aslan, S. et al. Human expert labeling process (HELP): towards a reliable higher-order user state labeling process and tool to assess student engagement. Educ. Technol., : 53–59. (2017).
- 29.Kuh, G. D. The National survey of student engagement: conceptual and empirical foundations. New. Dir. Institutional Res.2009 (141), 5–20 (2010). [Google Scholar]
- 30.Whitehill, J. et al. The faces of engagement: automatic recognition of student engagement from facial expressions. IEEE Trans. Affect. Comput., (1):86–98. (2014).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets generated and analyzed during the current study are not publicly available due to ethical restrictions related to identifiable facial data, but are available from the corresponding author upon reasonable request and subject to approval by the institutional ethics committee.





































