Abstract
In the evolving landscape of intelligent education systems, there is an urgent need to develop adaptive, transparent, and robust classroom analytics solutions that align with the priorities of human-centered artificial intelligence and multimodal interaction, as emphasized by the scope of this special issue. Existing physical activity evaluation tools for educational settings often lack scalability, context sensitivity, and the capacity to extract meaningful temporal patterns from high-dimensional behavior streams. Traditional methods tend to oversimplify the complexities of pedagogical dynamics, resulting in feedback that is static, ambiguous, or divorced from instructional intent. These challenges are met through an integrated solution that combines semantic alignment with hierarchical attention—where spatial-temporal patterns are first captured through layered attention modeling, and then adaptively contextualized to align with instructional objectives. Our system, built upon the KINEVAL architecture and the Pedagogical Contextualization Strategy (PCS), fuses motion trajectory embeddings, instructional state encoding, and environment-aware modulation to generate structured, interpretable evaluations of student performance. The inclusion of attention-based dilated GRUs, transformer-based pedagogical modeling, and peer-aware regularization not only enhances robustness and interpretability but also enables cross-domain generalization across diverse school contexts. Experimental validation shows substantial improvements over conventional baselines in accuracy, fairness, and alignment with expert annotations. This study contributes a scalable and pedagogically informed approach to classroom behavior analysis, directly supporting the special issue’s themes of intelligent sensing, adaptive learning, and multimodal system design.
Keywords: Multimodal evaluation, Attention mechanism, Classroom analytics, Pedagogical adaptation, Intelligent systems
Subject terms: Engineering, Mathematics and computing
Introduction
Intelligent evaluation of physical education (PE) classrooms in primary and secondary schools has become increasingly essential due to the growing emphasis on holistic education and students’ physical well-being1. Traditional evaluation methods rely heavily on subjective teacher assessments, which not only lack standardization but also fail to capture detailed motion data during activities. With the advent of AI technologies, there is a compelling need to develop more objective, real-time, and data-driven evaluation systems2. Not only can such systems enhance teaching quality and student engagement, but they can also provide quantifiable feedback on student performance and technique. Moreover, an intelligent model based on posture estimation and motion recognition can support differentiated instruction by identifying individual learning needs and movement deficiencies, ultimately fostering a more inclusive and scientifically guided PE curriculum3.
Early attempts at intelligent motion assessment mainly focused on constructing rule-based frameworks that relied on predefined templates to interpret movement patterns. These systems could provide structured feedback aligned with expert knowledge but were inherently rigid and struggled to cope with the diversity of actions observed in dynamic classroom environments4. As a result, while they offered a certain degree of interpretability and clarity, they were difficult to scale and lacked the flexibility to adapt to variations in students’ performance, leading to gaps in accuracy and completeness during real-time classroom use5.
To alleviate these constraints, researchers began incorporating statistical modeling techniques that could capture more nuanced relationships between movement features and performance outcomes. By learning from annotated motion sequences collected through video or wearable devices, these methods improved recognition precision and reduced dependence on manually defined rules6. However, they still required careful feature design and extensive data preparation, and their ability to generalize across different body types, age groups, and learning contexts remained limited, especially when faced with the noisy and unstructured nature of classroom activities7.
More recently, advances in hierarchical representation learning have enabled systems to automatically extract spatiotemporal patterns from raw visual and skeletal data without relying on hand-crafted features8. By leveraging layered computational models, these approaches can identify subtle posture deviations, complex action transitions, and individualized movement styles with greater robustness. Combined with pose estimation algorithms, they allow fine-grained evaluation of performance at the joint and limb level9. Nevertheless, the deployment of such models in real-world PE classrooms still faces challenges such as the need for large annotated datasets, significant computational resources, and the interpretability of their decision-making processes, which are crucial for educational stakeholders10.
Recent developments in 3D graph-based deep learning and point cloud techniques have provided new insights into motion segmentation and multimodal interaction in human-centered systems. For instance, Xing et al. proposed a 3D graph-based hand segmentation approach combining deep learning and laser point cloud data to enhance visual interaction and intelligent rehabilitation11. In a related study, they further explored intelligent rehabilitation in aging populations, leveraging point cloud representations and deep learning for hand function assessment and human-machine collaboration12. These studies underscore the potential of multimodal, context-aware modeling strategies in applications that require fine-grained motion understanding and adaptive feedback, such as physical education evaluation in real-world classroom settings.
Despite these advances, the core limitations of existing methods remain fundamentally unresolved. Rule-based systems are inherently brittle and incapable of adapting to heterogeneous movement patterns or instructional variations. Statistical models depend heavily on handcrafted features and provide limited robustness in noisy, multi-person classroom settings. Deep learning approaches, while powerful in representation learning, typically ignore pedagogical intent and environmental context, resulting in evaluations that are accurate but not instruction-aware, interpretable, or educationally meaningful. Moreover, most prior methods treat motion, pedagogy, and context as isolated modalities, preventing them from capturing the complex interaction between teacher guidance, student behavior, and classroom conditions. These limitations highlight the need for a unified, pedagogically grounded, and context-sensitive evaluation framework.
In contrast to previous approaches that rely solely on posture estimation or isolated motion features, our proposed model introduces a fundamental innovation by fusing multimodal data—kinematic motion trajectories, pedagogical instructional states, and environmental context signals—within a unified evaluative framework. This design allows for a more comprehensive and interpretable understanding of classroom dynamics that goes beyond surface-level motion recognition. By leveraging this multimodal fusion, our KINEVAL architecture can distinguish between technically similar movements with different instructional intents, adapt to diverse classroom settings, and generate context-sensitive evaluation feedback. Such integration significantly enhances robustness, fairness, and real-time responsiveness, effectively addressing the subjectivity, lack of scalability, and ambiguity that characterize traditional evaluation systems. This multimodal and pedagogically informed perspective represents a unique contribution to the field of intelligent education analytics.
Based on the aforementioned limitations, we propose an intelligent evaluation model that integrates real-time pose estimation with efficient motion recognition tailored for primary and secondary PE classrooms. Our method aims to bridge the gap between accuracy, scalability, and interpretability. By focusing on lightweight pose estimation algorithms and optimized action classification networks, the model is designed for real-world classroom deployment. It also incorporates feedback mechanisms to provide actionable insights for students and teachers alike. In doing so, the approach aligns with pedagogical goals and enhances the scientific foundation of PE instruction. Importantly, the model supports multi-scenario use, from individual assessment to class-level analytics, making it a versatile tool for modern education.
The model introduces a modular pipeline combining pose estimation with a domain-adaptive action recognition framework, ensuring efficient performance in diverse classroom environments.
It supports high adaptability across multiple scenarios, such as various sports activities and class sizes, with robust recognition accuracy and low latency suitable for real-time feedback.
Experimental evaluations on benchmark and custom datasets demonstrate superior accuracy and generalization compared to existing methods, confirming its effectiveness in practical PE settings.
Related work
Posture estimation techniques in educational settings
The development and refinement of posture estimation techniques have become central to enhancing intelligent evaluation models tailored for physical education (PE) classrooms in primary and secondary schools. Contemporary advancements in computer vision have introduced deep convolutional neural networks (CNNs) capable of detecting and tracking human skeletal joints with high precision. Frameworks such as OpenPose, HRNet, and MediaPipe utilize multi-stage heatmap representations or part affinity fields to localize keypoints such as shoulders, elbows, hips, knees, and ankles13. These frameworks form the backbone of posture estimation systems deployed in real-world environments where classroom conditions present unique challenges including variable lighting, occlusion from other students, and limited camera perspectives. For instance, multi-camera setups with overlapping fields of view can mitigate occlusion yet require sophisticated calibration and synchronization accuracy to reconstruct three-dimensional (3D) poses from two-dimensional (2D) images14. Calibration approaches often leverage the use of checkerboard patterns or structure-from-motion algorithms to estimate internal and external parameters of cameras while ensuring that spatial alignment remains robust across sessions. Moreover, domain adaptation techniques are employed to cope with the distribution shift arising from annotating teacher-led demonstration videos versus student-generated movement data15. Semi-supervised learning (SSL) and self-supervised pre-training through video-based reconstruction or predictive modeling enable the posture estimation models to generalize across differing classroom environments and individual variations in body shape and attire. Low-resolution camera constraints in budget-limited school settings impose additional performance hurdles; in response, knowledge distillation from high-resolution teacher models to lightweight student architectures ensures that real-time inference can be performed at edge devices without sacrificing accuracy16. To evaluate performance in PE classroom contexts, benchmarks are designed that include standardised physical education activities such as standing long jump, push-ups, squat, and shuttle run. Metrics cover mean per-joint position error (MPJPE), percentage of correct keypoints (PCK), as well as application-specific measures such as alignment angles and posture consistency over time. Ultimately, the convergence of multi-view geometry, domain adaptation, and efficient model design forms the basis upon which posture estimation engines can reliably integrate into intelligent PE evaluation frameworks that accurately assess student posture, alignment, and movement execution17.
Motion recognition and skill assessment models
Building upon robust posture estimation pipelines, motion recognition techniques provide semantic interpretation of student movements in PE classrooms, allowing for the automated assessment of exercise quality and adherence to predefined standards. The core challenge lies in mapping continuous sequences of skeletal joint coordinates to discrete movement primitives, gesture classes, or exercise quality ratings. Recurrent neural network (RNN) architectures such as long-short-term memory (LSTM) and gated recurrent unit (GRU) networks have been widely adopted to capture temporal dependencies and sequential patterns inherent in human movement18. More recently, temporal convolutional networks (TCNs) and Transformer-based models extend the capability to learn long-range dependencies with parallelized attention mechanisms. Model architectures are often trained on labeled datasets containing annotated repetition counts, movement phases, and correctness labels, with cross-entropy or regression losses reflecting classification or scoring tasks respectively. For more nuanced skill assessment, continuous quality scoring models are trained using ordinal regression or pairwise ranking approaches to differentiate between proficient, intermediate, and novice levels19. To create training sets, expert educators label video demonstrations to specify correct movement trajectories and common error patterns (knee valgus in squats, spinal misalignment in sit-ups). Error augmentation techniques generate synthetic variations by slightly perturbing joint trajectories to simulate realistic deviations that students may commit. Such augmentation enhances the model’s ability to generalize error detection. Multi-task learning frameworks unify classification of exercise type, detection of errors, counting of repetition, and quality scoring within a single unified architecture20. Learning tasks share a common backbone that extracts spatiotemporal features from joint trajectories, followed by task-specific heads optimized in parallel. This approach provides computational efficiency while leveraging mutual inductive bias across correlated tasks. In real-time deployment, models infer on skeleton streams capturing 2D or 3D joint coordinates and output frame-level exercise segmentation, repetition count, and correctness score. Feedback mechanisms translate the model outputs into actionable guidance — for example, recommending lift your chest more when detecting forward trunk lean in push-ups, or avoid knee collapse during squats21. Evaluation metrics for motion recognition systems include accuracy, F1-score for classification tasks, mean absolute error (MAE) for repetition counting, and correlation with expert ratings for quality scoring. Real-world classroom trials report that such models can over 90% accuracy in movement classification and within ±1 repetition counting error, while quality scoring correlates strongly (r>0.8) with human expert evaluations. Integration of such motion recognition modules into intelligent systems enables scalable, objective, and continuous assessment of physical education performance in primary and secondary schools22.
Human–computer interaction and feedback mechanisms
Effective integration of posture estimation and motion recognition into PE classroom environments depends critically on the design of human–computer interaction (HCI) interfaces and feedback systems that are intuitive, pedagogically effective, and culturally appropriate to the age and developmental level of primary and secondary school students. Real-time feedback must be delivered in a way that facilitates motor learning without distracting from instruction or discouraging students. Visual feedback modalities include skeletal overlays on teacher monitors, encouraging students to mimic correct posture by showing real-time silhouettes of their own and ideal poses side by side23. Color-coded joint indicators highlight misalignment, turning red for joints outside of defined thresholds. Care is taken to calibrate color usage and ensure accessibility for users with color vision deficiencies. Audio feedback may involve synthesized voice prompts such as straighten your back, but these are carefully timed to avoid interrupting flow; instead, feedback is queued to appear between sets or reps24. Haptic or wearable feedback via vibrotactile bands has also been explored for tasks requiring proprioceptive correction, such as gait training or single-leg balance. These wearables vibrate when joint angles exceed permissible ranges. The design of feedback systems is informed by principles of motor learning theory, particularly the use of knowledge of results (KR) and knowledge of performance (KP). KP feedback offers information about the movement pattern itself, while KR provides summary outcomes such as completion time or repetition count. Adaptive feedback algorithms adjust the granularity and frequency of feedback over time, shifting from immediate error-corrective guidance for beginners to delayed summary feedback for more advanced students, thereby promoting independence and reflection25. HCI systems also incorporate gamification elements: star ratings for performance, progress bars, achievement badges, and peer comparison leaderboards, all displayed on classroom dashboards. To preserve intrinsic motivation and avoid unhealthy competition, systems emphasize personal improvement over ranking; for instance, daily improvement metrics and personalized targets. Privacy and ethical considerations in classrooms lead to the implementation of privacy-preserving designs: skeletal data is processed on-device or in a secure local server with anonymization before any data storage. User consent protocols and parental permissions are integrated in compliance with relevant educational and data protection legislation26. Evaluations of HCI frameworks use mixed-method approaches: quantitative analyses measure improvements in movement form fidelity, engagement time, and lesson completion rates, while qualitative feedback from teachers and students via surveys and interviews assess usability, satisfaction, and perceived learning. Pilot studies in primary schools demonstrate that real-time feedback combined with gamified dashboards increases student engagement by 30–40% and improves correct form adherence by 25–35% over baseline. Through combining cutting-edge perceptual models with pedagogically-informed feedback design, intelligent evaluation systems become practical, effective, and ethically sound tools for enhancing physical education in primary and secondary school contexts27.
Method
Overview
The evaluation of physical education (PE) classrooms in schools plays a pivotal role in shaping the quality, equity, and impact of physical education as a discipline within the broader educational system. Unlike traditional academic subjects, the unique characteristics of PE—its emphasis on motor skills, physical literacy, engagement, and social development—pose distinct challenges for objective and reliable evaluation. This subsection outlines the structural components and methodological rationale of our approach to evaluating PE classrooms, providing a roadmap for the detailed developments in the following sections.
Section “Preliminaries” establishes the foundational problem setting, framing the core evaluation task using formal symbolic representations. We develop a representation of the PE evaluation process using mathematical abstractions that capture the complex interactions between student physical performance, teaching behavior, and environmental context. Key challenges include the temporal variability of class dynamics, the subjectivity of qualitative assessments, and the multidimensionality of physical competency. The section culmin in a symbolic expression of the PE evaluation task as a structured decision-making problem under uncertainty, laying the theoretical foundation for our subsequent innovations. Section "Kinematic-Narrative Evaluation Learner (KINEVAL)" introduces our proposed evaluation architecture, designated as KINEVAL (Kinematic-Narrative Evaluation Learner). This model is designed to integrate multimodal inputs—such as motion capture data, teacher verbal cues, and peer interaction metrics—within a unified representational framework. KINEVAL employs a novel hierarchical attention-based encoder that aligns temporal sequences of student activity with pedagogical rubrics and performance standards. By disentangling skill acquisition patterns from contextual noise, our model aims to provide both interpretable and actionable evaluation scores. Particular emphasis is placed on domain adaptation and robustness, ensuring that evaluations remain consistent across diverse school environments and instructional styles. In Section "Pedagogical Contextualization Strategy (PCS)", we describe our strategic innovation, termed the Pedagogical Contextualization Strategy (PCS). PCS operationalizes how evaluation criteria are dynamically adjusted based on instructional intent, developmental appropriateness, and class-wide behavioral trends. Whereas conventional assessments apply static metrics uniformly, our strategy conditions evaluative focus areas on the contextual embedding of each session—incorporating variables such as instructional phase, lesson objectives, and learner profiles. This strategy enhances fairness, supports formative feedback cycles, and mitigates evaluative bias. We incorporate an adaptive calibration mechanism that learns from teacher consensus over time, aligning algorithmic outputs with professional pedagogical judgment. Together, these components constitute an integrated methodology for PE classroom evaluation that addresses long-standing limitations in validity, scalability, and contextual sensitivity. By formalizing the evaluation task, designing a dedicated multimodal model, and embedding it within a flexible strategy framework, we provide a foundation for equitable and empirically grounded assessment in physical education.
Preliminaries
The evaluation of physical education (PE) classrooms involves capturing a variety of factors, including student motor performance, teacher instructional strategies, and environmental conditions. This process requires a multimodal, temporally structured representation that goes beyond conventional assessment models. In this section, we introduce the core symbolic components that form the foundation of our evaluation framework.
We represent a PE classroom session as a sequence of T discrete time steps:
![]() |
1 |
Each state
contains student physical states, teacher intent, and environmental context:
![]() |
2 |
Each student’s state vector
consists of motion, posture, and engagement descriptors:
![]() |
3 |
The teacher’s instructional intent is represented by:
![]() |
4 |
Environmental context features are denoted by:
![]() |
5 |
Each student is evaluated across L dimensions, forming an evaluation matrix
:
![]() |
6 |
where
denotes a latent nonlinear scoring function.
To capture performance progression over time, we extract a temporal embedding for each student:
![]() |
7 |
where
is a temporal encoding operator.
Session-level assessment further incorporates instructional effectiveness and diagnostic indicators:
![]() |
8 |
where
is the aggregated evaluation profile, Q summarizes teacher effectiveness, and
denotes session anomalies such as disengagement or safety risks.
This formulation describes PE evaluation as a temporally dynamic and context-aware inference process that integrates student behavior, instructional guidance, and environmental conditions. It provides the conceptual foundation for the KINEVAL model introduced in the following sections. A summary of all symbols is provided in Table 1 for reference.
Table 1.
Summary of Notation.
| Symbol | Definition | Dimension/Role |
|---|---|---|
![]() |
PE classroom session | Sequence of T time steps |
![]() |
Classroom state at time t | Includes student, teacher, context |
![]() |
Student i’s physical-motor vector | ![]() |
![]() |
Motion trajectory descriptor | Sub-vector of
|
![]() |
Postural stability indicator | Sub-vector of
|
![]() |
Engagement/affective features | Sub-vector of
|
![]() |
Teacher pedagogical intent | ![]() |
![]() |
Instructional modality | Part of
|
![]() |
Learning objective type | Part of
|
![]() |
Instructional density | Part of
|
![]() |
Environmental context vector | ![]() |
![]() |
Space layout indicator | Part of
|
![]() |
Equipment availability | Part of
|
![]() |
Safety status feature | Part of
|
![]() |
Evaluation score for student i on dimension
|
Scalar |
![]() |
Latent scoring function | ![]() |
![]() |
Student i’s temporal embedding | ![]() |
![]() |
Temporal encoder | Maps sequences to embeddings |
![]() |
Session-level average of E | ![]() |
| Q | Teacher instructional effectiveness | Scalar/vector summary |
![]() |
Diagnostic indicators | Disengagement, safety flags |
![]() |
Aggregated evaluation output | Tuple:
|
To enhance interpretability for readers from interdisciplinary or non-technical backgrounds, we provide conceptual clarifications for the mathematical structures introduced above. The session tensor
represents a temporally ordered sequence of observations collected throughout a physical education class. Each state
captures the complete classroom condition at time t, including student motion, teacher instructional behavior, and environmental factors. The student vector
describes the physical and cognitive status of student i at time t. This vector includes three primary components: motion trajectory descriptors (
), which encode patterns of movement across time; postural stability indices (
), which assess balance and alignment; and engagement-related signals (
), which reflect attentional focus or emotional responsiveness during activities. These elements together allow the model to evaluate both the quality and consistency of student performance. The teacher vector
represents the instructional focus at time t, encompassing the mode of delivery such as physical demonstration or verbal explanation, the pedagogical objective such as developing motor skills or improving cardiovascular endurance, and the intensity of student-teacher interaction. The environmental vector
includes spatial constraints, equipment availability, and safety conditions, which are critical for interpreting physical activity in real-world classroom contexts. The evaluation matrix
is generated through a latent mapping from these observed variables to a set of structured evaluation dimensions, such as coordination ability, physical endurance, and collaborative behavior. The mapping function
incorporates domain knowledge, temporal attention mechanisms, and pedagogical consistency to ensure that each score reflects meaningful educational performance. Contextual modulation ensures that evaluation is fair and sensitive to variations in teaching conditions and classroom setups. This formulation models the evaluation of physical education as a structured, context-aware decision process. Each component in the framework plays a pedagogically grounded role in quantifying student learning and instructional effectiveness in a multidimensional and interpretable manner.
Kinematic-Narrative Evaluation Learner (KINEVAL)
To address the symbolic structure and operational constraints, we propose a novel model, KINEVAL (Kinematic-Narrative Evaluation Learner), which maps multi-agent, multimodal classroom observations to structured evaluative outcomes. The architecture is designed to learn both the spatial-temporal dynamics of student performance and the pedagogical narrative embedded in teacher actions, while accounting for contextual variability. Below, we outline three key innovations of KINEVAL that jointly enable robust, interpretable, and context-aware evaluation in physical education classrooms (As shown in Fig. 1).
Fig. 1.
Schematic diagram of the Kinematic-Narrative Evaluation Learner (KINEVAL). The model consists of three main components: Unified Kinematic Encoding using Bi-GRU and attention to extract temporal features
, Pedagogical Alignment Modeling that computes alignment between teacher intent
and student motion using delay-aware cosine similarity and attention to obtain
, and Context-Aware Decoding that processes contextual metadata
to generate final evaluative embeddings
, which are used to predict the final score
.
Unified Kinematic Encoding
KINEVAL introduces a unified kinematic encoder to comprehensively integrate multi-scale temporal motion sequences into rich trajectory representations that faithfully preserve the dynamics of student activities in physical education classrooms. For each student i, the multimodal physical-motor input vector at time t is denoted as
, which captures joint positions, velocity signals, and auxiliary sensory cues from wearable or vision-based systems. To process such complex data, KINEVAL employs a hierarchical gated recurrent framework that fuses fine-grained micro-movements with long-term macro-behaviors. The temporal evolution of kinematics is first encoded through a multi-layered bidirectional GRU network with dilation levels
, generating intermediate states
that retain temporal resolution across multiple scales. To ensure proper normalization of temporal responses, each hidden state is stabilized using layer-wise gating:
![]() |
9 |
where
is the sigmoid gate, U and V are learnable parameters, and
denotes element-wise multiplication. After this stabilization step, KINEVAL introduces a temporal-attention mechanism that adaptively highlights critical motion segments based on their relevance to evaluative outcomes. The attention coefficient
for each time step and dilation level is computed as:
![]() |
10 |
where w and W denote trainable attention weights and
is an attention bias. These coefficients serve to modulate the temporal contribution of each frame, thereby allowing the model to focus on salient motion intervals such as rapid posture transitions or moments of instructional compliance. The weighted summation of attention-modulated features across all dilation levels forms the primary trajectory embedding
:
![]() |
11 |
capturing both local kinematic nuances and broader activity patterns in a compact vector. To enhance the expressiveness of this embedding, an additional temporal correlation matrix is computed to explicitly model inter-frame dependencies. Let
denote the self-similarity matrix whose entries measure the pairwise cosine similarity between encoded states:
![]() |
12 |
This temporal correlation information is then aggregated with
via a residual fusion step to capture both pointwise dynamics and relational temporal structures. Through this unified design, the encoder is capable of disentangling subtle posture variations, periodic motion cycles, and irregular performance deviations in a scalable and computationally tractable manner, thereby providing a versatile foundation for downstream evaluative modeling within the KINEVAL framework.
To improve reproducibility and theoretical grounding, we provide detailed descriptions of the core architectural parameters used in both the bidirectional GRU module and the cross-modal Transformer layers within our Kineval framework. The hierarchical GRU structure consists of three bidirectional layers, each with a hidden state size of 256. These layers incorporate dilated temporal connections with dilation rates
, allowing the model to capture short-term transitions and long-range dependencies in physical movement sequences. This design choice is inspired by prior work on dilated convolution and dilated RNNs, which have demonstrated effectiveness in expanding the temporal receptive field without a proportional increase in parameters or computation. Specifically, dilated temporal structures enhance the model’s ability to interpret slow-motion transitions and repetitive action cycles, which are prevalent in physical education scenarios. In the cross-modal Transformer architecture, we adopt four self-attention heads per layer and a fixed local attention window of 64 frames, which balances global context aggregation and computational efficiency. The position encoding strategy employs learnable positional embeddings instead of fixed sinusoidal encodings, enabling the model to flexibly adapt to the irregular and non-periodic movement patterns typically observed in real-world classroom environments. This decision is guided by recent findings suggesting that learnable encodings outperform hand-crafted alternatives in heterogeneous temporal domains. Together, these parameters were selected not only based on empirical performance but also with strong theoretical justification from motion recognition literature. Dilated architectures and multi-head attention mechanisms have consistently been shown to improve recognition accuracy in long-term motion modeling tasks. By explicitly reporting these core parameters and their underlying motivation, we aim to ensure full reproducibility and highlight the adaptability of our model to the unique movement characteristics in primary and secondary school PE classrooms.
Pedagogical Alignment Modeling
The second innovation emphasizes the intricate process of capturing and quantifying the alignment between teacher-led instructional trajectories and student kinematic behaviors. Unlike conventional evaluation approaches that treat teacher and student actions as independent streams, KINEVAL models their interaction as a coupled temporal process where the pedagogical narrative directly influences the motor responses of students (As shown in Fig. 2). Let
denote the teacher’s instructional state at time t, which is first embedded into a latent pedagogical representation
through a temporal transformer encoder
. Formally, we compute:
![]() |
13 |
where
represents the full instructional sequence over T steps. Simultaneously, the student’s multimodal kinematic vector
is encoded into a motion state
through a stacked bi-directional gated recurrent operator, preserving both forward and backward temporal dependencies for fine-grained motion tracking. To explicitly model the degree of resonance between teacher guidance and student response, KINEVAL computes a time-varying alignment score based on a normalized cosine similarity:
![]() |
14 |
This alignment signal captures whether the student’s physical state aligns with the teacher’s pedagogical intent at each timestep. However, momentary alignment may not fully reveal deeper instructional patterns; thus, KINEVAL aggregates temporal information using a gated alignment integrator that emphasizes critical instructional moments. A temporal attention weight
is learned to highlight informative steps:
![]() |
15 |
where v and U are learnable parameters capturing non-linear pedagogical dependencies. The weighted sum of alignment scores across the entire session then forms a pedagogical conformity vector
, which reflects how consistently a student aligns with instructional cues over time:
![]() |
16 |
Moreover, to account for latent structural dependencies between consecutive teacher-student interactions, the model incorporates convolutional self-attention on the alignment sequence
, enabling it to identify repetitive instructional motifs and the corresponding student behavioral responses. This process enhances the temporal expressiveness of the conformity vector, making it sensitive not only to immediate synchronization but also to delayed compliance and contextual nuances such as instructional pauses or emphasis points. By jointly modeling teacher intention, student kinematics, and their temporal interplay, this module enriches the evaluative representation with semantically meaningful alignment signals that reveal how students interpret and physically manifest pedagogical instructions in diverse classroom contexts.
Fig. 2.

Schematic diagram of the Pedagogical Alignment Modeling. Student motion input
is encoded by Bi-GRU into
, while the teacher instruction sequence
is encoded into
via a transformer. Cosine similarity is computed between
and
, with delay-aware alignment across a trailing window. The final conformity vector
is computed by attention-weighted integration of the alignment scores.
To further improve the fidelity of our alignment modeling in real-world classroom scenarios, we introduce a temporal delay-aware formulation to accommodate response latency in student behavior. Drawing from educational interaction theory, it is well established that students often exhibit a time lag between receiving a teacher’s instruction and physically enacting the corresponding movement, especially in complex or unfamiliar activities. Modeling alignment strictly at the same timestamp may therefore underestimate legitimate delayed imitation. To account for this, we redefine the alignment score at time t for student i as the maximum weighted similarity over a trailing window of instructional states:
![]() |
17 |
where k is a delay threshold (empirically set to 3 seconds), and
is a temporal decay weight such that more recent instructions are emphasized (such as
). This structure enables the model to account for delayed student reactions, maintaining high alignment scores for pedagogically valid behavior even in the presence of short response lags. Empirical observations from physical education sessions confirm that such delays are common and pedagogically acceptable. The revised alignment formulation significantly enhances robustness in scenarios involving demonstration-following, rhythm-based exercises, and step-by-step skill acquisition. This modification improves alignment estimation and correlates more closely with expert judgments on teacher-student synchrony.
Context-Aware Decoding
The third innovation introduces a context-modulated evaluation decoder that adaptively integrates heterogeneous classroom conditions into the final evaluative process, ensuring that environmental and situational factors are explicitly represented in the learning dynamics. Let
denote the contextual metadata at time t, capturing diverse aspects such as class density, noise level, available resources, or spatial constraints of the physical environment. These raw contextual signals are first projected into a latent space via a multilayer perceptron
, producing a context-aware modulation vector
:
![]() |
18 |
where
and
are trainable parameters and
is a nonlinear activation. To capture the temporal evolution of context across the session, a recurrent gating function is applied, enabling the model to emphasize salient contextual shifts while suppressing redundant information. We introduce a gated accumulation
computed as
![]() |
19 |
where
is a learnable gate that dynamically determines how much new context should be integrated relative to the historical state. The temporally smoothed contextual sequence
is then aggregated through an attention-weighted pooling mechanism to form a session-level contextual summary vector
:
![]() |
20 |
where u and V are learnable parameters determining the temporal relevance of each contextual state. This session-level context encoding acts as an adaptive conditioning signal that modulates the interpretation of both student motion features and pedagogical alignment. To produce the evaluation embedding for student i, KINEVAL fuses the kinematic representation
, the pedagogical conformity vector
, and the aggregated context vector
through a nonlinear transformation:
![]() |
21 |
where
denotes concatenation,
is an activation function such as GELU, and
are trainable parameters. This integrated embedding serves as a comprehensive representation that jointly encodes individual student performance, teacher-student interaction quality, and the dynamic classroom context, thereby enabling the decoder to generate evaluation outputs that remain sensitive to both micro-level behaviors and macro-level environmental variations within the session.
Pedagogical Contextualization Strategy (PCS)
While the KINEVAL architecture provides a rigorous mechanism for extracting structured evaluation representations from multimodal classroom data, the ultimate effectiveness of an evaluation framework hinges upon its ability to adapt those representations to situational pedagogical contexts. In this section, we redefine the Pedagogical Contextualization Strategy (PCS) through three key innovations that ensure evaluative interpretations remain fair, aligned with instructional phases, and sensitive to localized behavioral patterns (As shown in Fig. 3).
Fig. 3.
Schematic diagram of the Pedagogical Contextualization Strategy (PCS). The PCS architecture is an integrated tri-module pipeline comprising Context-Aware Scaling for dynamic phase-aligned score normalization, Entropy-Guided Variance to adapt evaluation stability under instructional uncertainty, and Expert-Prior Calibration for aligning model-driven assessments with domain-informed pedagogical standards. The architecture fuses amplitude and phase features via an Interaction Alignment Module (IAM), enabling joint reasoning across semantic and behavioral contexts. Expert-Prior Calibration further incorporates global averaging and element-wise interactions to refine evaluative outputs in accordance with instructional priors.
Context-Aware Scaling
To ensure robust alignment with diverse instructional phases, the Pedagogical Contextualization Strategy (PCS) enhances evaluative fairness by dynamically scaling raw student scores using a multi-level semantic embedding that captures both global curriculum intent and local classroom states. Let the raw evaluative score matrix be
for n students across L evaluative dimensions, and let
represent the contextual embedding derived from curricular metadata
, temporal teacher intent signals
, and aggregated engagement indicators
. We first define a semantic projection operator that enriches
with historical memory traces from prior sessions:
![]() |
22 |
where
denotes a memory tensor encoding longitudinal evaluation distributions. The relevance of each evaluative dimension
is determined through a nonlinear relevance gate, yielding a dynamic scaling factor
that adapts to the pedagogical context:
![]() |
23 |
where
denotes a set of semantically related dimensions forming a local pedagogical neighborhood. To counteract transient fluctuations, PCS interpolates between individual student scores and the class-wise stable anchor
, forming the contextualized score for student i on dimension
:
![]() |
24 |
where
denotes a context-sensitive residual capturing deviation from instructional norms, modulated by a scaling hyperparameter
. PCS introduces an entropy-compensated normalization to ensure that scaling remains sensitive to instructional uncertainty. The instructional entropy
at time t is computed from softmax-normalized teacher focus distributions
over J instructional intents, and its session-wide average H serves as a regulatory term:
![]() |
25 |
where
adjusts sensitivity to entropy-driven scaling. Through this multi-stage formulation, Context-Aware Scaling preserves the instructional intent while dynamically stabilizing evaluations under varying pedagogical contexts, ensuring that each dimension’s contribution is both phase-aware and statistically grounded in broader instructional dynamics.
From the perspective of educational evaluation theory, it is important to recognize that course metadata and teacher intention may exert differential influence on evaluative outcomes. In our model, both inputs are integrated via a shared semantic projection operator to form a unified embedding S. However, their conceptual roles differ: course metadata encodes long-term curricular structure and intended learning outcomes, while teacher intention reflects real-time instructional focus and pedagogical emphasis. To ensure that this asymmetry is preserved and interpreted meaningfully, we analyze the individual contributions of each source to the generation of the dimension-specific scaling factor
. We propose a gradient-based attribution analysis to quantify the relative impact of each information stream. For each evaluation dimension
, we compute the partial derivatives of
with respect to the encoded course metadata
and teacher intention sequence
:
![]() |
26 |
These gradient norms indicate the sensitivity of the scaling factor to each input type. Empirical evaluation reveals that technical and standard-aligned dimensions (such as posture, coordination) exhibit higher dependency on course metadata, while affective or behavioral traits (such as engagement, responsiveness) are more influenced by teacher intention. This finding aligns with pedagogical theory, which posits that evaluation priorities adapt dynamically with lesson objectives. By simulating sessions under varied pedagogical goals (such as skill development vs fitness training), we observe corresponding shifts in the vector
that redistribute evaluative focus accordingly. For instance, in sessions emphasizing endurance, PCS assigns more weight to physical output dimensions; conversely, in motor skill-focused contexts, cognitive and technical metrics are prioritized. These results suggest that our semantic fusion mechanism not only integrates information effectively but also aligns with educational principles of adaptive evaluation under goal-sensitive contexts.
Entropy-Guided Variance
PCS compensates for instructional uncertainty by modulating evaluation variance using the entropy of pedagogical action distributions, thereby introducing a robust mechanism to handle fluctuations caused by varying teaching strategies, spontaneous interactions, and dynamic classroom engagement (As shown in Fig. 4). Let
denote the probability of selecting a specific pedagogical action j at time step t, obtained through a softmax-normalized teacher intent vector
. The instantaneous entropy
of the instructional state at time t is defined as
![]() |
27 |
The average instructional entropy across a full session of T time steps can then be expressed as
![]() |
28 |
capturing the overall diffusion of instructional focus during the learning process. To incorporate this contextual uncertainty, the variance of the context-adjusted score distribution along each evaluative dimension
is scaled by an entropy-driven factor, yielding
![]() |
29 |
where
controls linear sensitivity to entropy and
introduces a second-order amplification effect for highly unstable instructional phases. The adjusted student-level scores are further stabilized through an entropy-weighted smoothing operation, ensuring that evaluations in high-uncertainty contexts remain balanced:
![]() |
30 |
where
represents the global mean score on dimension
and
regulates the smoothing strength as a function of entropy. By jointly leveraging instantaneous entropy, session-wide entropy, and entropy-amplified variance scaling, this formulation enhances tolerance for deviations under diffuse instructional focus while still retaining sensitivity to stable pedagogical patterns in low-entropy conditions.
Fig. 4.

Schematic diagram of the Entropy-Guided Variance. The model employs a series of dynamic convolutional components (D.C.) including BAM, BAN, and a central interaction module to regulate evaluative uncertainty arising from instructional entropy. Initial convolutional processing extracts temporal teaching features, which are refined through sequential attention modules and fed into a final convolutional layer. The entropy of pedagogical action distributions—computed from a softmax over intent vectors—governs both the scaling of evaluation variance and the entropy-weighted smoothing of student-level scores. By incorporating linear and quadratic entropy effects, the system dynamically adapts to fluctuating instructional strategies, ensuring that evaluative metrics remain robust across both stable and chaotic learning phases.
Expert-Prior Calibration
To align outputs with pedagogical norms, PCS employs a multi-step expert-prior calibration that integrates model-generated evaluations with domain-informed distributions, ensuring that scores are neither purely data-driven nor detached from established instructional reasoning. This process begins with the construction of an empirical expert distribution
for each evaluative dimension
, derived from teacher-annotated exemplars and validated classroom cases. Let the raw contextualized score
for student i on dimension
be a noisy observation in a latent pedagogical space. We introduce a probabilistic refinement step where the calibrated score
is treated as the posterior mean of a Bayesian update. First, we define a Gaussian likelihood model for the observed contextualized score:
![]() |
31 |
where
denotes the variance associated with model confidence on dimension
. Next, the expert prior
is represented as a categorical or continuous distribution over plausible pedagogical outcomes, which we approximate as a mixture of Gaussians to preserve multimodal expert opinions. The posterior over the refined score z is then given by Bayes’ theorem:
![]() |
32 |
which forms the foundation for computing the calibrated evaluation. To ensure that the final calibration does not overfit transient model predictions, we impose a regularized optimization that interpolates between the raw contextualized score and the expert prior by minimizing a combined energy functional:
![]() |
33 |
where
controls the trust in model predictions versus expert knowledge and
governs the strength of prior alignment. The closed-form calibrated score is obtained by solving for the minimizer of
:
![]() |
34 |
which balances empirical evidence, contextual relevance, and expert-informed constraints. Through this mechanism, PCS refines individual scores by incorporating structured pedagogical priors, resolving inconsistencies that arise from purely data-driven models, and maintaining coherence with normative educational standards while adapting dynamically to contextual uncertainty and heterogeneous expert judgments.
Experimental setup
Dataset
The Primary School Student Posture Dataset28 consists of annotated images and motion sequences capturing primary school students in various learning postures during activities. The dataset is collected using depth sensors and RGB cameras placed in typical classroom settings. Each sample is labeled with posture categories such as sitting upright, leaning, slouching, and hand-raising. The data emphasizes pose variation, occlusion scenarios, and classroom-specific interaction cues to support supervised learning models in education-oriented pose classification tasks. This dataset provides a foundational benchmark for evaluating posture detection algorithms tailored for young children in real-world school settings. The Secondary School Motion Recognition Dataset29 includes continuous action recordings of secondary school students performing predefined educational and physical actions during daily school routines. Data modalities include IMU (Inertial Measurement Unit) signals, video frames, and skeletal data captured through multi-sensor fusion technologies. Actions such as walking, running, turning, stretching, and interacting with educational equipment are richly annotated with temporal boundaries and participant metadata. The dataset supports sequential action recognition, student engagement analysis, and fine-grained movement understanding in environments. Its diversity in physical size and motion pace reflects realistic variations across age groups. The Physical Education Classroom Activity Dataset30 contains multi-angle video and sensor-based recordings of students participating in structured physical education (PE) activities under teacher supervision. Captured in indoor and outdoor PE environments, the dataset includes label categories for movement types, activity duration, group interactions, and instructor interventions. The dataset emphasizes multi-person action tracking, coordination analysis, and energy expenditure estimation. It is particularly suited for evaluating real-time activity segmentation and action-quality prediction models for school-based exercise scenarios, offering robust support for physical activity assessment systems. The Intelligent Evaluation Model Training Dataset31 is designed to support intelligent assessment systems by providing labeled sequences that correspond to both correct and incorrect behaviors during classroom and physical training sessions. It integrates pose estimations, action labels, correctness scores, and expert-annotated feedback collected from teachers and domain experts. The data is structured to train machine learning models that replicate or augment human evaluation in educational settings. It includes edge-case behaviors, confusion actions, and blended motions, making it valuable for generalization testing. This dataset serves as the backbone for intelligent behavior scoring, student performance evaluation, and automated instructional feedback generation.
Experimental details
The implementation is based on PyTorch 2.0, and training is performed using CUDA 11.7. We adopt a batch size of 32 for all datasets and set the initial learning rate to 0.001, which is decreased using a cosine annealing scheduler. Adam optimizer is employed with
,
, and weight decay set to
. Training is conducted for 100 epochs unless otherwise specified. Gradient clipping with a maximum norm of 1.0 is applied to stabilize training. All input data is normalized to zero mean and unit variance. For visual modality, video sequences are resized to 224
224 resolution and sampled at 30 fps. For sensor-based inputs (IMU or skeleton), we apply min-max normalization and temporal interpolation to align sequence lengths. Positional encoding is applied to capture temporal dynamics, especially for Transformer-based backbones. Dropout with a probability of 0.3 is used to mitigate overfitting. For the model backbone, we utilize a dual-stream architecture: one stream processes visual data via a ResNet-50 pretrained on ImageNet, and the other handles structured motion data through a temporal convolutional network (TCN). Feature fusion is conducted via a gated attention module, which adaptively weights cross-modal features based on context relevance. A multi-layer perceptron (MLP) follows the fusion layer and outputs class predictions through softmax activation. Evaluation metrics include Top-1 and Top-5 accuracy for classification tasks, F1-score for imbalanced class scenarios, and mean Average Precision (mAP) for multi-label cases. For regression-based behavior scoring, we report Mean Squared Error (MSE) and Pearson correlation coefficient. All results are averaged over three runs with different random seeds to ensure reproducibility. We follow best practices from top-tier conferences such as CVPR, AAAI, and ECCV to ensure scientific rigor and reproducibility. Hyperparameter tuning is performed using grid search on a 10% held-out validation set. Early stopping is used with a patience of 10 epochs based on validation loss. For cross-dataset generalization experiments, models trained on one dataset are evaluated on another without fine-tuning to assess robustness. Implementation follows FAIR’s code style guide and is verified for correctness through unit testing. Code and preprocessed data will be released upon publication to promote transparency and reproducibility.
To enhance the transparency and reproducibility of our experimental design, we provide detailed information about dataset partitioning and structure. For all four datasets used in this study—the Primary School Student Posture Dataset, Secondary School Motion Recognition Dataset, Physical Education Classroom Activity Dataset, and Intelligent Evaluation Model Training Dataset—we uniformly adopt a stratified splitting strategy to divide data into training, validation, and test sets. Specifically, each dataset is split using an 80-10-10 ratio, where 80% of the samples are used for training, 10% for validation, and 10% for testing. This stratification ensures that class distributions and behavioral variations are preserved across all subsets. To further validate the robustness of our results, we conduct 3-fold cross-validation on the training and validation sets for each dataset. The final performance metrics reported are averaged across three independent runs, each using a different random seed for splitting and initialization. This methodology helps mitigate overfitting and provides a more comprehensive assessment of model generalizability. In terms of dataset composition, the Primary School Student Posture Dataset includes approximately 12,000 labeled frames representing classroom postures such as upright sitting, slouching, leaning, and hand-raising, captured using RGB and depth cameras in real school settings. The Secondary School Motion Recognition Dataset contains over 9,500 continuous action clips recorded via IMUs and skeletal tracking, covering scenarios like walking, running, turning, stretching, and interacting with classroom objects. The Physical Education Classroom Activity Dataset consists of 8,300 multi-angle video clips and sensor readings collected during structured PE sessions, emphasizing multi-person interaction, coordination, and time-based segmentation. Lastly, the Intelligent Evaluation Model Training Dataset contains around 15,000 labeled samples enriched with expert feedback, correctness annotations, and ambiguous edge cases, designed specifically for learning evaluative patterns. These details are crucial for replicating the experimental pipeline and verifying the statistical integrity of our results. All datasets used are publicly available or will be released upon publication to further promote transparency and reuse in the research community.
Comparison with SOTA methods
The proposed method is benchmarked against six state-of-the-art video analysis models— I3D32, SlowFast33, TSN34, VideoMAE35, CoViAR36, and TimeSformer37—across four datasets: Primary School Student Posture Dataset, Secondary School Motion Recognition Dataset, Physical Education Classroom Activity Dataset, and Intelligent Evaluation Model Training Dataset. Results are summarized in Tables 2 and 3, with visual comparisons in Figs. 5 and 6. From the tables, it is evident that our method achieves the highest performance on all datasets and across all metrics. For instance, in the Primary School Student Posture Dataset, our method improves Accuracy by +2.87% over TimeSformer and +3.98% over VideoMAE. In terms of F1 Score, which balances precision and recall, our model achieves 92.03 compared to 88.95 by TimeSformer. In the Secondary School Motion Recognition Dataset, the gains are even more pronounced, with our model achieving 95.34% Accuracy and 92.45 F1 Score, significantly outperforming other approaches. These improvements demonstrate the model’s superior ability to recognize fine-grained, posture-sensitive, and temporally-evolving classroom behaviors, likely due to its integrated attention fusion and multi-modal representation, as explained in Section "Kinematic-Narrative Evaluation Learner (KINEVAL)". Notably, VideoMAE and TimeSformer also deliver competitive results, which confirms the utility of Transformer-based temporal modeling, but they lack the domain-specific tuning incorporated in our approach.
Table 2.
Benchmark results of different models on primary school posture recognition and secondary school motion analysis, highlighting improvements achieved by the proposed approach across multiple evaluation metrics.
| Model | Primary school student posture dataset | Secondary school motion recognition dataset | ||||||
|---|---|---|---|---|---|---|---|---|
| Accuracy | Recall | F1 Score | AUC | Accuracy | Recall | F1 Score | AUC | |
| I3D32 | 89.12±0.03 | 85.47±0.02 | 86.03±0.02 | 88.15±0.03 | 87.61±0.02 | 84.92±0.02 | 85.31±0.01 | 86.75±0.03 |
| SlowFast33 | 90.25±0.02 | 86.10±0.03 | 87.92±0.02 | 89.34±0.02 | 88.33±0.03 | 86.01±0.02 | 85.94±0.02 | 87.89±0.02 |
| TSN34 | 87.86±0.03 | 84.03±0.02 | 84.92±0.02 | 85.26±0.03 | 86.12±0.02 | 83.67±0.03 | 83.40±0.02 | 84.91±0.02 |
| VideoMAE35 | 91.±0.02 | 88.45±0.02 | 89.18±0.02 | 90.72±0.02 | 90.02±0.03 | 86.94±0.02 | 87.77±0.02 | 88.50±0.03 |
| CoViAR36 | 88.97±0.03 | 85.32±0.03 | 86.01±0.02 | 87.10±0.02 | 85.95±0.02 | 82.85±0.02 | 84.00±0.03 | 85.41±0.02 |
| TimeSformer37 | 92.11±0.02 | 89.20±0.02 | 88.95±0.03 | 91.10±0.02 | 89.87±0.03 | 87.31±0.02 | 87.99±0.02 | 89.00±0.02 |
| Ours | 94.98±0.02 | 91.66±0.02 | 92.03±0.02 | 94.25±0.02 | 95.34±0.02 | 92.87±0.02 | 92.45±0.02 | 94.90±0.02 |
Table 3.
Evaluation of competing models on physical education classroom activities and intelligent assessment training, demonstrating the effectiveness of the proposed framework under diverse scenarios.
| Model | Physical education classroom activity dataset | Intelligent evaluation model training dataset | ||||||
|---|---|---|---|---|---|---|---|---|
| Accuracy | Recall | F1 Score | AUC | Accuracy | Recall | F1 Score | AUC | |
| I3D32 | 88.43±0.02 | 85.76±0.03 | 84.95±0.02 | 86.80±0.02 | 86.59±0.02 | 83.24±0.02 | 84.13±0.02 | 85.67±0.03 |
| SlowFast33 | 90.51±0.03 | 87.82±0.02 | 88.11±0.02 | 89.77±0.03 | 88.92±0.02 | 86.45±0.03 | 86.27±0.02 | 88.10±0.02 |
| TSN34 | 86.97±0.02 | 83.40±0.02 | 84.22±0.03 | 85.20±0.02 | 85.03±0.03 | 82.70±0.02 | 83.18±0.02 | 84.34±0.02 |
| VideoMAE35 | 91.15±0.02 | 88.90±0.02 | 88.35±0.02 | 90.14±0.03 | 90.67±0.02 | 87.83±0.03 | 88.66±0.02 | 89.91±0.02 |
| CoViAR36 | 87.48±0.03 | 84.31±0.03 | 85.00±0.02 | 86.09±0.02 | 86.35±0.02 | 83.74±0.02 | 84.12±0.03 | 85.78±0.03 |
| TimeSformer37 | 92.37±0.02 | 89.12±0.03 | 89.46±0.02 | 91.03±0.02 | 91.45±0.03 | 88.50±0.02 | 89.03±0.02 | 90.72±0.02 |
| Ours | 95.02±0.02 | 92.67±0.02 | 91.90±0.02 | 94.10±0.02 | 94.76±0.02 | 92.81±0.02 | 91.42±0.02 | 94.85±0.02 |
Fig. 5.

Distribution of contributions by different methods across two student activity recognition datasets, illustrating the relative impact of each approach.
Fig. 6.
Comprehensive evaluation of multiple models on two distinct datasets from physical education classrooms and intelligent assessment training, highlighting their effectiveness across diverse scenarios and metrics.
In the Physical Education Classroom Activity Dataset, our method again demonstrates clear superiority. We achieve 95.02% Accuracy and 91.90 F1 Score, improving over TimeSformer by approximately +2.65% in Accuracy. Given the multi-person and dynamic nature of PE activities, these results indicate that our model handles spatiotemporal interaction and motion diversity more effectively. This advantage can be attributed to our method’s ability to simultaneously model both high-level semantics and fine-grained motion patterns through the dual-stream architecture and contextual fusion module. Models like TSN and CoViAR show relatively lower performance, likely due to their reliance on either sparse frame sampling or fixed compressed domain representations, which are insufficient in capturing the continuity and complexity of physical actions in group scenarios. Similarly, for the Intelligent Evaluation Model Training Dataset, our method achieves 94.76 Accuracy and 91.42 F1 Score, outperforming the closest competitor by over 3 points in both metrics. This dataset poses a unique challenge as it involves learning from partially correct and pedagogically meaningful student actions. The superior performance suggests that our model generalizes well under soft supervision and handles ambiguous labels effectively, likely due to the robustness of our feature gating and multi-modal alignment strategies.
These consistent gains highlight KINEVAL’s core strengths: attention-based fusion for context-sensitive integration of visual and motion cues, adaptability across diverse educational settings, and structured correctness supervision for reasoning about action quality. These design choices enhance generalization, robustness, and interpretability. Low variance across trials further confirms training stability. Both quantitative and qualitative results validate the model’s effectiveness for intelligent classroom and activity analysis.
Ablation study
Extensive ablation studies are conducted on four datasets to evaluate the individual contributions of key components within the proposed architecture. We evaluate the performance of the full model against three ablated variants: without the Context-Aware Decoding, without Context-Aware Scaling, and without Expert-Prior Calibration. The results are reported in Tables 4 and 5, covering the Primary School Student Posture, Secondary School Motion Recognition, Physical Education Classroom Activity, and Intelligent Evaluation Model Training Datasets. These comparisons allow us to quantify the individual impact of each component on final performance across four metrics.
Table 4.
Impact of individual module removal on primary school posture recognition and secondary school motion analysis datasets, highlighting the contribution of each component.
| Model | Primary school student posture dataset | Secondary school motion recognition dataset | ||||||
|---|---|---|---|---|---|---|---|---|
| Accuracy | Recall | F1 Score | AUC | Accuracy | Recall | F1 Score | AUC | |
| w./o. Context-Aware Decoding | 92.16±0.02 | 88.74±0.02 | 90.02±0.02 | 91.37±0.02 | 93.11±0.03 | 90.80±0.02 | 89.20±0.02 | 91.50±0.02 |
| w./o. Context-Aware Scaling | 93.78±0.03 | 89.90±0.02 | 91.27±0.02 | 93.02±0.03 | 94.02±0.02 | 91.15±0.03 | 90.87±0.02 | 93.12±0.02 |
| w./o. Expert-Prior Calibration | 91.82±0.03 | 90.21±0.02 | 88.13±0.02 | 90.91±0.02 | 92.27±0.02 | 89.76±0.02 | 89.33±0.03 | 91.40±0.03 |
| Ours | 94.98±0.02 | 91.66±0.02 | 92.03±0.02 | 94.25±0.02 | 95.34±0.02 | 92.87±0.02 | 92.45±0.02 | 94.90±0.02 |
Table 5.
Impact of individual module removal on physical education classroom activity and intelligent evaluation training datasets, showing the effect of different components on overall outcomes.
| Model | Physical education classroom activity dataset | Intelligent evaluation model training dataset | ||||||
|---|---|---|---|---|---|---|---|---|
| Accuracy | Recall | F1 Score | AUC | Accuracy | Recall | F1 Score | AUC | |
| w./o. Context-Aware Decoding | 91.43±0.03 | 88.92±0.02 | 89.01±0.03 | 90.08±0.02 | 92.03±0.02 | 89.90±0.03 | 89.12±0.02 | 91.48±0.02 |
| w./o. Context-Aware Scaling | 92.78±0.02 | 89.85±0.02 | 88.97±0.02 | 91.75±0.03 | 93.55±0.03 | 91.20±0.02 | 90.40±0.02 | 92.70±0.02 |
| w./o. Expert-Prior Calibration | 90.15±0.03 | 87.40±0.02 | 87.95±0.02 | 89.77±0.02 | 91.78±0.02 | 88.63±0.03 | 89.30±0.02 | 90.95±0.03 |
| Ours | 95.02±0.02 | 92.67±0.02 | 91.90±0.02 | 94.10±0.02 | 94.76±0.02 | 92.81±0.02 | 91.42±0.02 | 94.85±0.02 |
From Fig. 7, we observe that removing the Context-Aware Decoding leads to a notable drop in performance, with Accuracy decreasing by 2.82% and F1 Score by 2.01 on the Primary School Student Posture Dataset. This indicates that the fusion module is essential for integrating visual and motion representations effectively. Similarly, the exclusion of Context-Aware Scaling results in consistent declines across all metrics, especially in F1 Score, suggesting that aligning semantic spaces of different modalities enhances discriminative power. The impact of removing the Expert-Prior Calibration is also significant, with a performance decline of more than 3 points in several metrics, confirming its role in regularizing learning toward educational context-relevant patterns. On the Secondary School Motion Recognition Dataset, we observe similar trends. The full model achieves 95.34% Accuracy and 92.45 F1 Score, whereas each ablated variant suffers from performance degradation, reinforcing the necessity of each component.
Fig. 7.
Comparison of different methods on primary and secondary school datasets, showing the effect of individual module removal and the improvement achieved by the proposed approach.
Figure 8 presents results that substantiate the effectiveness of each component within the KINEVAL framework. On the Physical Education Classroom Activity Dataset, the full model achieves 95.02% Accuracy and 91.90 F1 Score. Removing Context-Aware Decoding lowers Accuracy by 3.59 points, highlighting its role in modeling group dynamics. Excluding Context-Aware Scaling also degrades performance, showing the importance of synchronized modality learning. In the Intelligent Evaluation Model Training Dataset, omitting Expert-Prior Calibration drops AUC and F1 by over 3 points, emphasizing its role in distinguishing nuanced student behaviors. These ablations support KINEVAL’s design focus on adaptive attention fusion, aligned representation learning, and domain-informed supervision.
Fig. 8.
This figure shows the effects of removing different modules across two datasets. Solid and dashed lines represent results on the Physical Education and Intelligent Evaluation datasets, respectively.
To substantiate our claims regarding the generalizability and real-time applicability of the KINEVAL model, we conducted additional scenario-based evaluations across multiple types of physical education activities and varying classroom configurations. We tested the model’s performance on subsets involving different movement categories—such as track and field, ball games, and group fitness—and in both small-class (fewer than 15 students) and large-class (more than 30 students) settings. These scenarios reflect typical variations encountered in real-world school environments. As shown in Table 6, the KINEVAL model maintains high accuracy and F1 scores across all categories, with only minor fluctuations due to class density and motion complexity. The lowest accuracy (92.8%) occurs in large-class scenarios, where occlusion and motion overlap are more frequent. Nonetheless, the model still achieves near state-of-the-art results in all cases, demonstrating robust adaptability to diverse teaching contexts and group dynamics. In addition, we measured the average inference latency per frame on GPU-enabled servers and edge devices. All latency values remained under 35 milliseconds per frame, even in multi-person scenarios, thereby confirming that the model is well-suited for real-time classroom deployment with low-latency feedback capabilities. These results directly support the model’s adaptability and low-latency operation, as stated in the Introduction and Methodology sections. They also confirm that KINEVAL can generalize well across activity types and instructional settings while remaining computationally efficient for real-time use in primary and secondary school classrooms.
Table 6.
Scenario-based performance evaluation of KINEVAL model.
| Scenario | Accuracy (%) | F1 Score | Latency (ms/frame) |
|---|---|---|---|
| Track and field | 94.5 | 0.912 | 28 |
| Ball games | 93.7 | 0.906 | 31 |
| Group fitness | 95.1 | 0.924 | 26 |
| Small class (<15 students) | 95.7 | 0.933 | 24 |
| Large class (>30 students) | 92.8 | 0.897 | 34 |
To further investigate the internal contributions of KINEVAL’s design, we conducted fine-grained ablation experiments by isolating specific components and hyperparameter settings. As presented in Table 7, the fine-grained ablation results on both the PE-Classroom Activity Dataset and the Intelligent Evaluation Model Training Dataset consistently validate the structural integrity and effectiveness of each KINEVAL component. Across all variants, the full model yields the highest performance, achieving 95.02% accuracy and 91.90 F1 score on the PE-Classroom dataset, and 94.76% accuracy and 91.42 F1 score on the Intelligent Evaluation dataset. Removing individual modules leads to significant and systematic performance drops across both benchmarks. For instance, excluding the GRU dilation mechanism reduces the F1 score by 2.29 points on the PE-Classroom dataset and by 2.52 points on the Intelligent Evaluation dataset. The temporal attention module shows a similar impact, with F1 score declines of 3.05 and 3.87 points, respectively, confirming its central role in capturing long-range motion dynamics. Delay-aware alignment contributes to instruction-following fidelity, and its removal leads to a 3.0-point F1 reduction on average. Among all modules, the contextual gating mechanism shows the most severe performance drop, with F1 scores decreasing by 4.24 points on PE-Classroom and 4.48 points on the Intelligent Evaluation dataset, highlighting its importance in dynamically adapting the evaluation to environmental variations and instructional context. The hyperparameter experiments also reflect consistent trends across datasets. Reducing the attention window from 64 to 32 results in a performance drop of 1.49 F1 points (PE-Classroom) and 1.76 points (Intelligent Evaluation), while increasing it to 128 offers moderate gains with manageable computational overhead. Likewise, lower embedding dimensionality (128) leads to underfitting and performance degradation (−2.60 F1 on PE-Classroom, −2.79 on Intelligent Evaluation), whereas a higher dimension (512) slightly improves accuracy but provides diminishing returns compared to the default 256 setting. These cross-dataset results collectively demonstrate that the architecture’s improvements are not dataset-specific but structurally robust. The similar degradation patterns across tasks validate the necessity of each component and support the generalizability of KINEVAL’s modular design in both motion-rich and feedback-oriented PE evaluation settings.
Table 7.
Fine-grained ablation comparison on two datasets. All results are averaged over three runs.
| Variant | PE-Classroom Activity | Intelligent Eval. Training | ||
|---|---|---|---|---|
| Accuracy (%) | F1 Score | Accuracy (%) | F1 Score | |
| Full Model (KINEVAL) | 95.02 | 91.90 | 94.76 | 91.42 |
| w/o GRU Dilation | 93.14 | 89.61 | 92.73 | 88.90 |
| w/o Temporal Attention | 92.76 | 88.85 | 91.90 | 87.55 |
| w/o Delay-aware Alignment | 93.05 | 88.92 | 92.25 | 88.01 |
| w/o Contextual Gating | 91.93 | 87.66 | 91.48 | 86.94 |
| Attention Window = 32 | 93.70 | 90.41 | 93.12 | 89.66 |
| Attention Window = 128 | 94.38 | 91.03 | 94.33 | 90.87 |
| Embedding Dim = 128 | 93.02 | 89.30 | 92.85 | 88.63 |
| Embedding Dim = 512 | 94.87 | 91.75 | 94.52 | 91.18 |
Conclusions and future work
In this study, we to address critical limitations in existing physical education evaluation systems within primary and secondary school classrooms, which often fail to capture the temporal and contextual nuances of student motion and behavior. To resolve these issues, we designed an intelligent evaluation framework based on posture estimation and motion recognition, incorporating a dual-layer architecture. This system integrates a hierarchical attention-based model for spatial-temporal representation learning with a semantic contextualization strategy that aligns performance metrics with pedagogical objectives. Using the KINEVAL framework and Pedagogical Contextualization Strategy (PCS), we fused motion trajectory data, instructional context, and environmental cues to provide interpretable, structured evaluations. Experimental results demonstrated that our model significantly outperformed conventional baselines in terms of accuracy, fairness, and alignment with expert assessments, thus proving its effectiveness in real-world classroom scenarios.
Despite these advancements, two main limitations remain. First, the system’s dependence on high-quality, labeled motion data may hinder scalability across resource-constrained educational settings, limiting its broader applicability. Second, while the framework allows for contextual adaptation, its generalizability across diverse cultural and curricular environments needs further validation. Future work should focus developing lightweight, semi-supervised data pipelines to reduce annotation overhead and on refining the contextual models to better accommodate cross-cultural pedagogical variations. These enhancements will further the potential of intelligent classroom evaluation systems in promoting equitable and adaptive physical education.
Acknowledgements
Sincere appreciation is extended to the colleagues, institutions, and organizations whose support was instrumental in enabling the completion of this research.
Author contributions
Conceptualization, YO; methodology, YS; software, YS; validation, YS; formal analysis, YS; investigation, YO; data curation, YO; writing—original draft preparation, YO, YS; writing—review and editing, YO; visualization, YO; supervision, YS; funding acquisition, YS; All authors have read and agreed to the published version of the manuscript.
Funding
It is essential to include full details of all funding sources supporting this work, with grant numbers specified where applicable. Be advised that this information cannot be modified after the paper has been published.
Data availability
The datasets generated and/or analysed during the current study are available in https://sandbox.zenodo.org/records/307316?
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Liu, A. et al. Evaluation of a classroom-based physical activity promoting programme. Obes. Rev.9, 130–134 (2008). [DOI] [PubMed] [Google Scholar]
- 2.Morgan, P. & Hansen, V. Recommendations to improve primary school physical education: Classroom teachers’ perspective. The journal of educational research101, 99–108 (2007). [Google Scholar]
- 3.Boyle-Holmes, T. et al. Promoting elementary physical education: results of a school-based evaluation study. Heal. Educ. & Behav.37, 377–389 (2010). [DOI] [PubMed] [Google Scholar]
- 4.Story, M., Nanney, M. S. & Schwartz, M. B. Schools and obesity prevention: creating school environments and policies to promote healthy eating and physical activity. The Milbank Q.87, 71–100 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Bailey*, R. Evaluating the relationship between physical education, sport and social inclusion. Educ. review 57, 71–90 (2005).
- 6.Mercier, K. & Doolittle, S. Assessing student achievement in physical education for teacher evaluation. J. Phys. Educ. Recreat. & Dance84, 38–42 (2013). [Google Scholar]
- 7.Phillips, S. R., Mercier, K. & Doolittle, S. Experiences of teacher evaluation systems on high school physical education programs. Phys. education sport pedagogy22, 364–377 (2017). [Google Scholar]
- 8.Hills, A. P., Dengel, D. R. & Lubans, D. R. Supporting public health priorities: recommendations for physical education and physical activity promotion in schools. Prog. cardiovascular diseases57, 368–374 (2015). [DOI] [PubMed] [Google Scholar]
- 9.Watson, A., Timperio, A., Brown, H., Best, K. & Hesketh, K. D. Effect of classroom-based physical activity interventions on academic and physical activity outcomes: a systematic review and meta-analysis. Int. J. Behav. Nutr. Phys. Activity14, 114 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Jackson, B., Whipp, P. R., Chua, K. P., Pengelley, R. & Beauchamp, M. R. Assessment of tripartite efficacy beliefs within school-based physical education: Instrument development and reliability and validity evidence. Psychol. Sport Exerc.13, 108–117 (2012). [Google Scholar]
- 11.Xing, Z. et al. Towards visual interaction: hand segmentation by combining 3d graph deep learning and laser point cloud for intelligent rehabilitation. IEEE Internet Things J. (2025).
- 12.Xing, Z. et al. Intelligent rehabilitation in an aging population: empowering human-machine interaction for hand function rehabilitation through 3d deep learning and point cloud. Front. Comput. Neurosci.19, 1543643 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Hasyim, A. H., Haris, I. N. & Yulianto, A. G. Analysis of evaluation models in physical education learning: Systematic literature review. Indonesian J. Sport Manag.4, 98–105 (2024). [Google Scholar]
- 14.Centeio, E. et al. The success and struggles of physical education teachers while teaching online during the covid-19 pandemic. J. Teach. Phys. Educ.40, 667–673 (2021). [Google Scholar]
- 15.Wahidah, I., Listyasari, E., Rahmat, A. A. & Rohyana, A. Evaluation of physical education independent curriculum through cipp: managerial implementation in learning activities. Indonesian J. Sport Manag.3, 208–223 (2023). [Google Scholar]
- 16.Khairuddin, K., Masrun, M., Bakhtiar, S. & Syahruddin, S. An analysis of the learning implementation of physical education in junior high schools. J. Cakrawala Pendidikan42, 241–253 (2023). [Google Scholar]
- 17.Gustian, U., Saputra, D. R., Rakhmat, C., Yustiana, Y. R. & Primayanti, I. Physical education and its scope: A literature review of empirical studies with a holistic perspective teaching practices in indonesia. Indonesian J. Phys. Educ. Sport Sci.4, 171–186 (2024). [Google Scholar]
- 18.Ding, X., Peng, W. & Yi, X. Evaluation of physical education teaching effect based on action skill recognition. Comput. Intell. Neurosci.2022, 9489704 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Gao, X. et al. From motion signals to insights: A unified framework for student behavior analysis and feedback in physical education classes. arXiv preprint arXiv:2503.06525 (2025).
- 20.Goodyear, V. A., Skinner, B., McKeever, J. & Griffiths, M. The influence of online physical activity interventions on children and young people’s engagement with physical activity: a systematic review. Phys. Educ. Sport Pedagog.28, 94–108 (2023). [Google Scholar]
- 21.Phillips, S. M. et al. A systematic review of the validity, reliability, and feasibility of measurement tools used to assess the physical activity and sedentary behaviour of pre-school aged children. Int. J. Behav. Nutr. Phys. Activity18, 141 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Wilhite, K. et al. Combinations of physical activity, sedentary behavior, and sleep duration and their associations with physical, psychological, and educational outcomes in children and adolescents: a systematic review. Am. journal epidemiology192, 665–679 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Liu, Y. et al. Exploring intelligent feedback systems for student engagement in digitized physical education learning through machine learning. In Proceeding of the 2024 International Conference on Artificial Intelligence and Future Education, 114–118 (2024).
- 24.Zhu, W. Evaluation of physical education system and training framework performance based on automatic information retrieval system. Int. J. Reliab. Qual. Saf. Eng. (2023).
- 25.Wu, J.-h. Design and development of artificial intelligence dynamic physical education teaching resources in human-computer interaction mode. J. Educ. Comput. Res. 07356331251337991 (2025).
- 26.Gasteiger, N. Virtual reality and augmented reality for upskilling care home workers in hand hygiene practice: A realist evaluation. Ph.D. thesis, The University of Manchester (United Kingdom) (2023).
- 27.Karvelas, I. The Role of Compilation Mechanisms in Novice Programming Behaviour. Ph.D. thesis, University College Dublin. School of Computer Science (2022).
- 28.Cao, F., Xiang, M., Chen, K. & Lei, M. Intelligent physical education teaching tracking system based on multimedia data analysis and artificial intelligence. Mob. Inf. Syst.2022, 7666615 (2022). [Google Scholar]
- 29.Lu, Y. & Long, H. Motion capture algorithm for students’ physical activity recognition in physical education curriculum. J. Artif. Intell. Technol.5, 1–9 (2025). [Google Scholar]
- 30.Hu, C. Evaluation of physical education classes in colleges and universities using machine learning. Soft Comput.26, 10765–10773 (2022). [Google Scholar]
- 31.Li, S., Wang, C. & Wang, Y. Fuzzy evaluation model for physical education teaching methods in colleges and universities using artificial intelligence. Sci. Reports14, 4788 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Peng, Y., Lee, J. & Watanabe, S. I3d: Transformer architectures with input-dependent dynamic depth for speech recognition. In ICASSP 2023-2023 IEEE international conference on acoustics, speech and signal processing (ICASSP), 1–5 (IEEE, 2023).
- 33.Kim, G.-I., Yoo, H. & Chung, K. Slowfast based real-time human motion recognition with action localization. Comput. Syst. Sci. & Eng. 47 (2023).
- 34.Zanbouri, K. et al. A comprehensive survey of wireless time-sensitive networking (tsn): Architecture, technologies, applications, and open issues. IEEE Commun. Surv. & Tutorials (2024).
- 35.Shen, Y., Zhang, J. & Li, Y. Behavior recognition of teachers and students in the smart classroom based on deep learning. In 2023 4th International Conference on Information Science and Education (ICISE-IE), 345–349 (IEEE, 2023).
- 36.Terao, H., Noguchi, W., Iizuka, H. & Yamamoto, M. Multi-stream single network: efficient compressed video action recognition with a single multi-input multi-output network. IEEE Access12, 20983–20997 (2024). [Google Scholar]
- 37.Nguyen, T. G. T. et al. Video classification based on the behaviors of children in pre-school through surveillance cameras. In International Conference on Intelligence of Things, 45–54 (Springer, 2023).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets generated and/or analysed during the current study are available in https://sandbox.zenodo.org/records/307316?













































































