Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Jan 30;16:6670. doi: 10.1038/s41598-026-36668-y

Semantic-aware self-supervised learning using progressive sub-action regression for action quality assessment

Marjan Mazruei 1, Ehsan Fazl-Ersi 1,, Abedin Vahedian 1, Ahad Harati 1
PMCID: PMC12913629  PMID: 41611864

Abstract

Action Quality Assessment (AQA) is a growing field in computer vision that focuses on objectively evaluating human actions from videos, with applications across various domains. Current approaches typically provide only a single overall score, which lacks the granular details necessary for actionable performance feedback. This limitation is compounded by the scarcity of fine-grained annotations; While a few publicly available datasets contain sub-action temporal boundaries, none provide explicit sub-score labels. This paper introduces a novel framework that addresses these challenges by decomposing actions into interpretable sub-actions and leveraging self-supervised learning to enhance feature representations. An unsupervised temporal segmentation module first partitions a video into semantically meaningful sub-actions. Subsequently, a self-supervised learning mechanism refines the initial spatio-temporal features, making them more robust to temporal irregularities and more discriminative for subtle motion nuances. These robust features are then used in a progressive pseudo-subscore learning mechanism that explicitly models the sequential dependencies between sub-actions, generating fine-grained feedback that differentiates between short-range causal effects and cumulative long-range influences. The efficacy of the proposed framework is validated through comprehensive experiments on the UNLV-Diving and FineDiving datasets. The results demonstrate state-of-the-art performance on the Spearman’s Rank Correlation (SRC) metric, confirming that robust feature representations and explicit temporal modeling are crucial for accurate assessment.

Keywords: Action quality assessment, Fine-grained analysis, Unsupervised temporal semantic segmentation, Self-supervised learning, Pseudo-subscore generation, Multi-substage regression

Subject terms: Computational biology and bioinformatics, Mathematics and computing

Introduction

With advancements in deep learning and video analysis, the automated modeling of human behavior has become a central focus in computer vision. Within this domain, AQA is a specific and challenging task that involves regressing a continuous quality score for a given action sequence. Unlike standard action recognition, which provides a categorical label, AQA offers a nuanced measure of proficiency. This task holds significant implications across a wide range of applications, including sports, healthcare, industrial training, and even the evaluation of AI-generated content. Such systems are instrumental in enhancing professional skills, streamlining training, and reducing the subjectivity and resource costs associated with human evaluation.

Most existing AQA methodologies focus on predicting a single overall score for an entire action sequence. This approach, however, often overlooks the intricate details of sub-action-level features and their contextual relationships, consequently failing to provide an explanation for the final assessment. The absence of fine-grained feedback, such as identifying a specific weakness during a dive’s entry phase, severely limits the utility of these systems. Furthermore, a significant hurdle for fine-grained AQA is the scarcity of annotated resources. The majority of publicly available datasets lack temporal segmentation annotations and explicit sub-action scores, which are prohibitively expensive and time-consuming to acquire. As a result, many models struggle to provide sub-action-level feedback.

This study proposes a novel framework designed to overcome these challenges by decomposing the AQA task into a sequence of interpretable sub-actions. The proposed approach utilizes an unsupervised temporal segmentation module to automatically partition videos into semantically meaningful stages, thereby circumventing the need for expensive manual annotations. A key innovation of this work is the integration of a self-supervised learning module that enhances the robustness and discriminative power of the extracted spatio-temporal features. This module compels the network to learn robust feature representations by simulating temporal occlusions through frame masking strategy, which is crucial for capturing the subtle execution nuances that define quality. These enhanced features are then used in a progressive pseudo-subscore learning mechanism that models sequential dependencies. This mechanism generates fine-grained, interpretable feedback by treating sub-scores as intermediate latent variables. The framework is trained using pseudo-subscores, which are generated by augmenting robust spatio-temporal features with the overall score label. Subsequently, these pseudo-subscores, along with contextual dependencies, enrich the spatio-temporal features with valuable evaluative information from the overall score and preceding sub-actions. This iterative process refines the representation of each sub-action, enabling robust pseudo-subscore generation and leading to superior overall score prediction. The model is trained using the generated pseudo-subscores, with training videos serving as input. Finally, a post-training step removes the overall score label, allowing the final model to predict both sub-scores and overall scores directly from video input, ensuring accurate assessments.

The main contributions of this work are summarized as follows:

  • A novel framework is proposed that integrates unsupervised temporal semantic segmentation with a self-supervised learning strategy to perform robust AQA and provide detailed feedback for identifying strengths and weaknesses in the performance of the athlete in sport competitions without requiring fine-grained annotations.

  • A self-supervised refinement module is introduced to enhance spatio-temporal feature representations, improving their robustness to temporal irregularities and their ability to capture subtle motion nuances critical for quality assessment.

  • A novel progressive pseudo-subscore learning mechanism is devised to explicitly model the temporal causality of sub-actions. This is achieved through two distinct strategies to capture:

  1. Short-terms dependencies, where the features of each sub-action are augmented with the pseudo-subscore of its immediately preceding sub-action.

  2. Long-term dependencies, where the cumulative impact of all preceding sub-actions is taken into consideration.

  • A comprehensive evaluation is conducted on the UNLV-Diving1 and FineDiving2 datasets, where the proposed framework achieves state-of-the-art performance on the SRC metric. The effectiveness of each architectural component is further validated through extensive ablation experiments, demonstrating the superiority of the proposed framework.

The remainder of this paper is structured as follows. The related work is first reviewed. The proposed framework is then presented, with its components described in detail. This is followed by a presentation of the experimental results and analysis. Finally, the paper concludes with a summary of findings and a discussion of future research directions.

Related work

AQA is a rapidly evolving field that aims to automate the nuanced evaluation of human actions in video data. Deep learning-based AQA frameworks commonly comprise two main components: a feature extraction module and an evaluation module (Fig. 1). This section reviews the literature with a particular emphasis on the evolution of feature extraction techniques and the strategies employed to model and predict action quality. The discussion first provides a brief summary of advancements in feature representation learning that underpin effective AQA systems, and then reviews existing AQA methodologies.

Fig. 1.

Fig. 1

Framework of the deep learning-based AQA model.

Feature extraction using representation learning for AQA

Effective AQA is fundamentally dependent on the ability to extract meaningful and discriminative features from video sequences. This has been a central challenge in the field of video representation learning. Early AQA systems primarily relied on hand-crafted features, which involved the meticulous extraction of spatio-temporal keypoints from video frames. These methods often used classical signal processing techniques such as Discrete Fourier Transform (DFT) or Discrete Cosine Transform (DCT)3 to refine features or aggregated features from dense trajectories using a Bag-of-Words (BoW) or Fisher vector encoding4,5. A significant limitation of these methods was their inability to fully capture discriminative features, making it difficult to understand the subtle intra-class differences in actions that are crucial for nuanced AQA6. Pose-based features were also explored to offer interpretable feedback by providing guidance on limb positioning, but their reliability was often compromised by inaccuracies in pose estimation and their inability to model contextual visual cues, such as the interaction with objects3.

The landscape of video representation was transformed with the rise of deep neural networks, which demonstrated a superior ability to automatically learn hierarchical features. Initial deep learning approaches repurposed 2D Convolutional Neural Networks (CNNs) from image classification for video tasks by processing individual frames and aggregating the features7. However, this approach largely neglects temporal relationships, which are essential for AQA. To address this, researchers developed two-stream CNNs, which separately process spatial information from RGB frames and temporal information from pre-computed optical flow fields8. While effective, this approach is computationally expensive due to the optical flow pre-computation. A more direct solution came with the introduction of 3D Convolutional Networks (3D ConvNets), which extend 2D convolutions to encompass the temporal dimension, enabling the simultaneous modeling of appearance and motion. Pioneering works like C3D9 and I3D10 demonstrated the power of this approach. To mitigate the high computational cost of full 3D convolutions, architectures like the Pseudo-3D (P3D) ConvNet factorized 3D kernels into separate 2D spatial and 1D temporal convolutions, resulting in significant parameter reduction and improved efficiency11.

Despite these advances, 3D CNNs have high memory and computational costs, making them suitable for processing only short video clips. Most AQA approaches operate on clip-level feature representations rather than video-level representations. This involves dividing long videos into smaller video clips of fixed temporal length, extracting features for each clip, and then integrating these clip-level features to obtain a video-level representation. This strategy, however, can suffer from intra-clip confusion. Some works have segmented videos into sub-actions, which improves feature efficacy. Nonetheless, these approaches require temporal boundary annotations for the training set to learn an effective semantic segmentation model.

Beyond the specific domain of AQA, the principles of semantic consistency and temporal relation modeling employed in the proposed framework align theoretically with broader challenges in weakly supervised video understanding, particularly where frame-level annotations are unavailable. Recent advancements in weakly supervised video anomaly detection have leveraged semantic-driven consistency learning12 and hierarchical position–scale awareness13 to enforce feature robustness against noise—a methodology conceptually analogous to the masking-based self-supervision strategy utilized in this study. Furthermore, the objective of distilling the intrinsic semantics of sub-actions parallels established paradigms in few-shot and zero-shot action recognition, as exemplified by architectures such as the Video Disentangling Attentive Relation Network (VDARN)14.

Action quality assessment methods

Existing AQA methodologies can be broadly categorized into three major groups based on their core objectives. These approaches, which will be discussed in detail, provide a structured perspective for positioning the proposed framework.

Regression-based

The most prevalent approach to AQA is treating it as a regression problem to predict a continuous score. Early studies established supervised regression frameworks, which subsequently evolved in diverse research directions encompassing granularity, modality fusion, generalization, continual learning, explainability, and self-supervision. A seminal contribution was made by Pirsiavash et al.3, who formulated AQA as a supervised regression task and introduced a learning-based framework that extracted spatio-temporal pose features, refined them using the DCT, and employed linear support vector regression (L-SVR) for score prediction. Venkataraman et al.15 showed that the approximate entropy of pose features better encoded motion information than DCT/DFT. A complementary line of work proposed a real-time solution using a two-layer hierarchical framework that jointly modeled posture and motion features16. Parmar and Morris1 further advanced the field by employing C3D features9 in combination with SVR and LSTM-based models17, while Li et al.18 utilized C3D9 with ranking and MSE losses. In addition, HalluciNet19 proposed a lightweight 2D-CNN for predicting 3D spatio-temporal features from a single image, reducing computational cost.

The modeling of sub-action granularity constitutes a foundational paradigm in broader video understanding tasks, establishing the theoretical basis for fine-grained analysis. Wang et al.20 validated the necessity of such granularity for effective weakly supervised temporal action localization, facilitating the discernment of precise temporal boundaries. Concurrently, Liu et al.21 proposed a hierarchical video description framework, positing that complex activities are best understood by modeling the hierarchical structure of their constituent events. Building upon these insights from action localization and recognition, Segment-aware approaches have become increasingly important for providing fine-grained feedback in AQA. These methods decompose actions into semantically meaningful sub-actions, enabling stage-wise evaluation. For instance, the stacked 3D regressor (S3D)22 employed temporal convolutional networks (ED-TCN)23 to segment videos into sub-actions, whose features were subsequently fused for final score prediction. Following this, ScoringNet24 introduced a key-fragment segmentation method using 3D CNNs and Bi-LSTMs, coupled with a dual-loss strategy to emphasize discriminative segments, and MSRM25 utilized temporal semantic segmentation and multi-substage regression to predict substage scores.

The work most closely related to our own is the Label-Reconstruction-based Pseudo-Subscore Learning (PSL) method26, which also generates pseudo-subscores to overcome the lack of fine-grained annotations. While PSL marked a significant step forward, our framework introduces several critical advancements that substantially improve upon its limitations. First, a novel self-supervised learning module refines initial spatio-temporal features via a temporal masking mechanism, a capability not present in PSL. This is a fundamental improvement that addresses the problem of feature representation robustness. Second, a more sophisticated progressive pseudo-subscore learning mechanism explicitly models the sequential dependencies between sub-actions, distinguishing between short-range and long-range causal impacts. This is a crucial distinction from PSL, which treats sub-scores as independently generated intermediate variables. Furthermore, a key architectural difference lies in the method of temporal segmentation. While PSL relies on a supervised approach (ED-TCN) that necessitates explicit temporal annotations, proposed framework utilizes an unsupervised segmentation method. This removes the reliance on time-consuming boundary labels, making the model applicable to a wider range of datasets lacking such fine-grained annotations.

The emergence of Transformer-based architectures27 has further improved the modeling of long-range temporal dependencies. Xu et al.28 introduced a self-attentive LSTM to capture critical technical movements, complemented by a multi-scale convolutional skip LSTM for local and global dynamics. Lei et al.29 developed a temporal attention framework that adaptively emphasized score-relevant video segments, while the grade-decoupling Likert Transformer leveraged cross-attention to extract grade-aware representations30. Iyer et al.31 incorporated self-attention to capture extended temporal dependencies, whereas Zhang et al.32 proposed a time-aware attention mechanism to model inter-segment relationships. Ji et al.33 presented LUSD-Net, a localization-assisted uncertainty score disentanglement network that separated score-relevant features while accounting for spatial uncertainty. Lian and Shao34 advanced across-stage temporal reasoning to capture inter-stage dependencies, particularly under data imbalance conditions. Several domain-specific datasets and architectures have further enriched the field. Liu et al.35 introduced a figure-skating dataset tailored for replay-guided modeling of action sequences.

Other recent works have explored sophisticated temporal modeling through attention and graph-based approaches. Zhang et al.6 addressed intra-clip confusion and inter-clip incoherence with a hierarchical GCN that modeled multi-level spatial and temporal dependencies across joints. Extensions to this idea incorporated group-aware attention into GCNs to evaluate group dynamics36 and Huang and Li37 developed a semantic-sequence performance regression model with densely distributed sample weighting to capture fine-grained temporal cues, while Ke et al.38 introduced a two-path target-aware contrastive regression method to simultaneously model global and local action dynamics.

Actor–object-centric approaches mitigate the influence of irrelevant background information by concentrating on the primary subject and contextually relevant objects. Early work in this direction, JR-GCN39, proposed a joint-relation graph to model dependencies among body joints. Zeng et al.40 designed a hybrid attention network that integrated dynamic temporal cues with static contextual information, while Chen et al.41 introduced SportsCap, a monocular 3D motion capture framework that combined pose estimation with fine-grained motion analysis. Additional strategies explicitly trained models to suppress background influence. For example, C3D-AVG-SA&HMreg applied adversarial loss to reduce reliance on background features42, and TSA-Net43 employed target-tracking to consistently focus on the performing athlete. More recently, Huang et al.44 presented a dual-referenced assistive network incorporating semantic-level grade prototypes and rating-guided attention to reconstruct quality-oriented features.

To tackle the inherent biases and ambiguity in human-provided scores, uncertainty-aware approaches have been developed to predict score distributions rather than single values. Models like the Uncertainty-Aware Score Distribution Learning (USDL)45 and the Distribution Auto-Encoder (DAE)46 are trained to minimize the divergence between a predicted score distribution and the ground-truth distribution derived from multiple judges. Other works have employed Conditional Variational Auto-Encoders to model perceptual uncertainty47 and integrated probabilistic priors into temporal modeling with a Gaussian-guided encoder48. Majeedi et al.49 proposed Inline graphic, a rubric-informed calibrated method that integrated predefined evaluation criteria to guide feature learning for structured action scoring.

Parmar and Morris50 pioneered multitask learning by jointly training a single framework for action recognition, commentary generation, and score prediction. This concept has been extended to multimodal fusion, where different data streams, such as RGB video, skeleton data, and audio, provide complementary information. These methods leverage the robust nature of skeletal information5155, audio signals56,57, or a dynamic fusion of multiple modalities to improve accuracy5860. Du et al.61 proposed a semantics-guided representation learning approach to incorporate high-level action semantics, while Gedamu et al.62 aligned visual features with semantic structures through a visual-semantic alignment parsing method.

Generalization across diverse actions and domains has also been actively pursued. Knowledge transfer techniques63 have been applied to extend models across multiple actions. Continual learning in AQA tackles the need for models to accumulate knowledge over time without catastrophic forgetting. Methods such as parameter-efficient continual pre-training (PECoP)64 and Continual-AQA65 use techniques like feature-score correlation-aware rehearsal to retain a memory of previous tasks while learning new ones. Similarly, manifold-aligned graph regularization (MAGR)66 enforces consistency between new and old feature distributions to maintain performance stability.

Explainability have emerged as critical considerations. Rubric-informed frameworks67 and hierarchical neurosymbolic approaches68 combined neural learning with symbolic reasoning to enhance transparency. Dong et al.69 addressed long-term action scoring using attention-based loss and query initialization to alleviate temporal skipping in transformer models. Self-supervised learning has also emerged as a powerful paradigm to reduce reliance on labeled data. S4AQA70 employed masked segment feature recovery to exploit temporal dependencies in unlabeled data, combining supervised regression with adversarial alignment between labeled and unlabeled distributions. More recently, some approaches have focused on causal reasoning71, self-supervised sub-action parsing72, and semi-supervised paradigms73 to improve fine-grained modeling.

Pairwise-comparison-based

In contrast to regression, pairwise-comparison approaches formulate AQA as a ranking task, focusing on comparing the quality of videos rather than assigning absolute scores. Such methods are particularly valuable in scenarios where distinguishing relative performance differences is more reliable than relying on subjective human-provided scores. Early works utilized Siamese two-stream CNNs trained with a ranking loss to compare pairs of videos74,75. Subsequent work introduced rank-aware temporal attention models, in which learnable attention mechanisms emphasized skill-relevant segments, and ranking losses were incorporated to enhance discriminative capability76. This paradigm was further extended with transformer-based architectures, such as the framework proposed by Bai et al.77, which extracted fine-grained temporal representations using transformer decoders and optimized performance with ranking and sparsity loss functions. Recent works have introduced rhythm-aware transformers that modeled temporal patterns and cadence to assess skill through ranking-based evaluation78 and Gedamu et al.79 developed a fine-grained spatio-temporal parsing network that encoded detailed action dynamics across frames. Xu et al.2 introduced a procedure-aware approach with a temporal segmentation attention module, leveraging cross-attention and fine-grained contrastive regression. FineParser80 advanced spatio-temporal parsing for detailed motion analysis, while Xu et al.81 Proposed procedure-aware datasets and protocols to support stage-wise evaluation. Hipiny et al.82 introduced a ranked TikTok dance dataset and designed a pairwise framework to assess motion quality through relative comparisons, thereby facilitating fine-grained performance evaluation.

The concept has since been refined through frameworks like group-aware contrastive regression83,84. An et al.85 proposed a multi-stage contrastive regression method that progressively refined score predictions by contrasting intermediate action representations, while Fang et al.86 extended pairwise ranking to educational contexts by assessing teacher performance through classroom video analysis. Other advancements include adaptive frameworks that dynamically adjust feature modeling to accommodate varying action characteristics87 and skill transfer approaches that leverage learned skills from source actions for new tasks88. Although pairwise-comparison methods excel in capturing relative differences, they typically require large-scale annotated datasets to construct sufficient and meaningful video pairs, which remains a significant limitation.

Classification-based

Classification-based frameworks have been extensively applied in domains such as surgical skill assessment, where performance can be categorized into discrete levels (e.g., novice, intermediate, expert). Early research used pose-based features with sparse Hidden Markov Model (HMM)89 for surgical gesture classification. Subsequent work advanced these methods by incorporating robot kinematic data with entropy-driven features and weighted fusion strategies, outperforming traditional HMM-based approaches90. Other efforts combined video and accelerometer data to jointly model kinematic and visual cues91. Graph-based methods have also been explored to capture motion patterns in rehabilitation contexts. Li et al.92 proposed a graph convolutional Siamese network for recognizing and assessing rehabilitation exercises, while Zheng et al.93 designed a skeleton-based framework that employed rotation-invariant features to ensure consistency across viewpoints. Bruce et al.94 introduced EGCN++, an ensemble learning strategy that fused multiple skeleton-based models for improved accuracy in rehabilitation assessment. A coarse-to-fine instruction alignment approach has been adopted to incorporate hierarchical guidance for detailed action evaluation95. Although these approaches offer generalizability, they rely on rigid performance categories that may overlook subtle differences in operator actions, limiting their applicability to complex real-world scenarios where continuous scoring is required.

The proposed framework builds upon the strengths of fine-grained AQA methods by addressing two of their most significant limitations: the lack of robust feature representations and the simplistic modeling of temporal dependencies. This not only generates fine-grained feedback from coarsely annotated videos but also models the sequential dependencies between sub-actions, distinguishing itself from previous works.

Proposed approach

This study introduces a novel framework for vision-based human action quality assessment that addresses the limitations of sparse annotations and the inherent challenges in modeling fine-grained temporal dynamics. The framework is designed to automatically generate performance scores from sports videos by formulating the scoring task as a multi-stage regression problem. At its core, the framework integrates unsupervised sub-action decomposition with a self-supervised learning strategy to produce robust feature representations. A progressive pseudo-subscore mechanism is then employed to model sequential dependencies and refine the final prediction.

A key innovation lies in the use of a self-supervised learning module to enhance initial spatio-temporal features. This module is trained to capture the intrinsic characteristics of sub-action clips by learning a robust representation from temporally masked video segments. This process compels the network to encode not only local motion patterns but also long-range dependencies, thereby yielding representations that are more robust and semantically meaningful for the downstream AQA task. The overall architecture of the framework is illustrated in Fig. 2. As shown, the proposed framework comprises five key components: (1) an unsupervised sub-action segmentation model, (2) a robust spatio-temporal feature extraction model, (3) a pseudo-subscore calculation model, (4) a sequential feature augmentation with pseudo-subscores model, and (5) a multi-substage AQA regression model.

Fig. 2.

Fig. 2

Overview of the proposed framework. The framework consists of the following key stages: (1) Unsupervised Sub-action Segmentation Module: Input videos are segmented into meaningful video clips, each representing an individual sub-action. (2) Robust Spatio-Temporal Feature Extraction Module: This module first extracts initial spatio-temporal features from each sub-action using a pre-trained 3D convolutional network. A robust, semantic-aware representation for each sub-action is then derived via a self-supervised learning approach. (3) Pseudo-subscore Calculation Module: pseudo-subscores are generated for each sub-action by integrating overall score labels with extracted features. (4) Sequential Feature Augmentation with Pseudo-subscores Module: Sequential dependencies are modeled by iteratively enhancing sub-action representations with contextual information from the overall score label and preceding pseudo-subscores. (5) Multi-substage AQA Module: Refined pseudo-subscores are aggregated to produce an overall score to ensure accurate predictions.

Problem Formulation

The objective of vision-based AQA is to automatically and objectively evaluate specific human actions. Given a video dataset Inline graphic, where N denotes the total number of videos, each Inline graphic consists of L frames, and Inline graphic is the corresponding overall quality score, the goal is to learn a mapping function Inline graphic that predicts the overall score Inline graphic of an athlete for any given video V. The proposed framework can be formally expressed as:

graphic file with name d33e751.gif 1

where Inline graphic, Inline graphic, and Inline graphic denote the parameters of the initial spatio-temporal feature extraction, self-supervised representation learning, and multi-substage score regression modules, respectively.

Unsupervised sub-action segmentation

Assigning a single score to an entire video sequence is a common practice in AQA, but it overlooks the temporal dynamics and limits the granularity of performance feedback. To enable a finer-grained analysis, each input video Inline graphic is segmented into a set of semantically meaningful video clips, each representing a distinct sub-action. This process is formally defined as:

graphic file with name d33e777.gif 2

where Inline graphic is an unsupervised temporal semantic segmentation function that partitions Inline graphic, which denotes a sequence of L consecutive RGB frames with spatial resolution Inline graphic, into a set of video clips Inline graphic. Each video clip Inline graphic corresponds to a sub-action segment of variable duration Inline graphic. Due to the inherent variability in human motion and action composition, the segment lengths Inline graphic generally differ across the M clips.

Given that temporal sub-action annotations are rare in publicly available datasets, an unsupervised segmentation approach is essential. Unlike prior works that rely on supervised fine-grained labels for sub-action segmentation22,25,26, this framework leverages an unsupervised temporal segmentation approach due to the scarcity of such annotations. The Temporally-Weighted FINCH (TW-FINCH) algorithm96 is adopted for this purpose due to its efficacy in identifying meaningful temporal boundaries without requiring explicit labels. This unsupervised decomposition is a crucial preliminary step, as it facilitates a more detailed, sub-action-level assessment of the video.

Robust spatio-temporal feature extraction

The extraction of robust spatio-temporal features from video data is critical for fine-grained AQA, where subtle execution nuances must be captured for accurate evaluation. While existing 3D convolutional networks (e.g., C3D9, P3D11, I3D10) are capable feature extractors, their representations often lack robustness to temporal discontinuities or occlusions. This section details a self-supervised learning strategy designed to enhance initial features and create a more robust representation.

Initial spatio-temporal feature extraction

After unsupervised temporal semantic segmentation, the resulting video clips Inline graphic are processed to obtain initial spatio-temporal feature representations. The feature extraction process is formulated as:

graphic file with name d33e857.gif 3

where Inline graphic is a deep neural network, and K is the dimensionality of the feature vector. The framework is designed to be highly adaptable, allowing the integration of various architectures, including 3D Convolutional Networks (C3D)9, Inflated 3D Networks (I3D)10, and other advanced video feature representation networks. For its efficiency and strong performance, the Pseudo-3D (P3D) network11 is employed as the backbone.

Pre-trained P3D models typically require a fixed temporal window to produce a single feature representation per video clip. This fixed-length constraint is primarily driven by two considerations: locality and memory. Firstly, unlike 2D CNNs, which capture spatial locality, 3D CNNs are specifically designed to directly capture temporal locality within a video segment. Secondly, managing GPU memory is a practical consideration, as the number of parameters increases with the number of frames, making long-term 3D CNNs memory-intensive. Because every frame can be highly influential on the final score, it is important to sample a fixed number of frames that are robust against potential information loss. The output is an initial feature vector Inline graphic for each video clip Inline graphic.

Self-supervised learning for robust semantic-aware sub-action representation

To address the limitations of conventional feature extractors, a self-supervised learning module is introduced to enhance the discriminative power of the features. This module operates on the premise that a robust representation of a sub-action should be invariant to minor temporal perturbations. A strategic temporal masking mechanism is devised to train the model to learn the underlying essence of each sub-action without relying on external annotations. The process formulates as:

graphic file with name d33e896.gif 4

where Inline graphic is the self-supervised network, and Inline graphic is a temporally masked version of original video clip Inline graphic.

This approach simulates realistic temporal occlusions by selectively masking a contiguous subset of frames within each video clip, thereby generating challenging query samples. For each original video clip Inline graphic, a binary temporal mask Inline graphic is defined (Equation 5), where Inline graphic indicates a masked frame and Inline graphic denotes a retained frame. The removal of consecutive frames, rather than random or isolated frames, is intentional; it introduces a temporal ambiguity that provides a meaningful learning objective for the model. The starting index Inline graphic of the masked segment is randomly selected from Inline graphic, and the mask is defined as:

graphic file with name d33e943.gif 5

Specifically, let Inline graphic denote the ordered set of time indices where frames are preserved. The masked clip Inline graphic is then constructed as the subsequence of frames from Inline graphic corresponding to these indices, formally defined as:

graphic file with name d33e961.gif 6

where Inline graphic is the j-th element in the sorted set Inline graphic. This approach systematically removes information from d consecutive frames, forcing the encoder Inline graphic to learn features that are robust to such perturbations. This masking process is repeated across different temporal regions of the video clip, generating multiple incomplete variations for each Inline graphic, thereby augmenting the training data for the self-supervised network. To accommodate the fixed-length input requirement of the pre-trained backbone, both the original and masked video clips are subject to a fixed-frame sampling procedure. This ensures that the feature extractor receives consistent input dimensions while mitigating the risk of information loss.

The core training objective is to ensure that the feature representation of the masked video clip remains consistent with that of the original unmasked video clip. The pre-trained P3D backbone processes both the original video clip Inline graphic and the masked video clip Inline graphic to extract feature vectors Inline graphic and Inline graphic, respectively. A self-supervised module is then trained to predict the complete feature vector Inline graphic from the masked feature vector Inline graphic. This is achieved by minimizing the discrepancy between feature representations derived from masked and unmasked video clips, thereby enhancing the encoder’s temporal robustness. Specifically, the method aims to produce feature vectors that remain invariant to missing temporal segments, ensuring reliable performance in video-based action quality assessment tasks. The learning objective is to minimize the distance between the two representations using a semantic-aware loss, Inline graphic:

graphic file with name d33e1020.gif 7

This loss ensures that the model learns features that are robust to temporal disruptions while preserving the sub-action clip’s semantic content. The training of this module is depicted in Fig. 3. The output of this phase is an improved set of spatio-temporal features, Inline graphic, where each Inline graphic represents an enriched, robust sub-action representation.

Fig. 3.

Fig. 3

Overview of the robust semantic-aware sub-action representation model using self-supervised learning.

Score regression

Following the extraction of robust semantic-aware clip-level features, the final scores are predicted through a multi-substage process that leverages the overall score labels to progressively refine pseudo-subscores. This approach addresses the lack of ground-truth sub-action scores and captures crucial sequential dependencies. Specifically, a separate sub-network is designed for each sub-action to ensure tailored score predictions.

Pseudo-subscore calculation

The available datasets primarily provide overall score labels without sub-action annotations. Moreover, the manual annotation of sub-action scores requires specialized expertise, which is challenging to obtain. In the absence of fine-grained sub-action labels, pseudo-subscore values are generated in the initial stage of the proposed score regression framework to guide the model. Following robust spatio-temporal feature extraction, the input sub-action clips are converted into a feature set Inline graphic, where Inline graphic, K represents feature dimensions, and m denotes the sub-action index. In this phase, feature augmentation is performed by constructing new features (Inline graphic) through a weighted combination of robust spatio-temporal features (Inline graphic) and scoring features (Inline graphic). The robust spatial-temporal features, which capture the dynamic and structural characteristics of the input, are normalized such that their weights sum to one. The scoring features provide contextual evaluation information, with the overall score normalized to lie between zero and one. Mathematically, the augmented feature vector for each sub-action is represented as follows:

graphic file with name d33e1085.gif 8

Similar to the PSL method26, in this phase, the overall score label is embedded into the feature vector as a scoring feature (Equation 9):

graphic file with name d33e1098.gif 9

Here, Inline graphic is a weighting factor that balances the contribution of the overall score and the original features, and Inline graphic denotes the operation of concatenating features. This operation produces a new feature set Inline graphic, where Inline graphic. The augmented features are then passed through a fully connected network to predict an initial pseudo-subscore, denoted as Inline graphic. These enhanced feature sets are subsequently fed into a fully connected network (FCN) with five layers. The initial layer of this network takes the label-augmented features as input, and the number of nodes decreases progressively in each layer until the output layer has a single node. The final layer applies a Sigmoid activation function to produce the predicted pseudo-subscore. The pseudo-subscore calculation process can be described as follows:

graphic file with name d33e1124.gif 10

Here, Inline graphic denotes the fully connected operation,Inline graphic is the output of the last layer, Inline graphic represents the parameters for the t-th layer, and Inline graphic denotes the predicted pseudo-subscore for the m-th sub-action. This pseudo-subscore generator is replicated across all sub-action to independently estimate the pseudo-subscores. Once the pseudo-subscores for all sub-actions are generated, they are fed into fully connected network to predict the overall score for the video. This process is mathematically expressed as:

graphic file with name d33e1152.gif 11

In this equation, Inline graphic represents the predicted overall score for the input video, and Inline graphic refers to the weight parameters of the fully connected layer. According to machine learning and convolutional network theory, the training phase requires defining an objective function to minimize, guiding the optimization of the model parameters. In this framework, the mean squared error (MSE) is used as the loss function between the predicted score and the overall score label.

graphic file with name d33e1165.gif 12

Here, Inline graphic is the number of training videos. Inline graphic is the predicted overall score for the i-th training sample, and Inline graphic is the corresponding overall score label. The trained model processes the training videos to extract pseudo-subscores for the sub-actions of each video, denoted as Inline graphic. This comprehensive approach ensures that the model leverages training data effectively to generate pseudo-subscores. The pseudo-subscore calculation algorithm is outlined in Algorithm 1 in the Appendix.

Sequential feature augmentation with pseudo-subscores

Incorporating the overall score label as a feature for all sub-actions introduces an inherent bias in the learning process, causing sub-action predictions to align excessively with the overall score. This reduces their diversity and limits their ability to capture localized variations within the activity. To address this, it is essential to consider the dependencies between sub-actions. Specifically, this section examines the sequential dependencies among the substages within an activity and investigate the nature of the relationships between sub-actions. Understanding these dependencies is crucial for designing an effective architecture that accurately estimates the action’s quality score. To this end, two distinct approaches are proposed and explored, as detailed below.

Incorporating the immediately preceding sub-score   The temporal relationships between consecutive sub-actions are crucial in AQA, as the performance of each sub-action directly influences the next. For example, in diving, a minor error in the take-off posture can disrupt rotations or somersaults, affecting overall performance. Similarly, improper form or insufficient speed during a somersault can lead to a poor spin, impacting the dive’s entry. The “Entry” itself is one of the most critical components of a dive. A well-executed entry with a flat body position minimizes splash, reflecting precision and control. Conversely, an inability to maintain proper body position during entry often results in a larger splash, detracting from the overall quality of the dive. This underscores the importance of understanding how previous sub-action impacts the current one for accurate performance evaluation.

After obtaining pseudo-subscores for all sub-actions, these scores are utilized sequentially as features to enhance the model further. This enhancement process integrates richer contextual information into the feature representations, thereby improving the model’s ability to compute more precise and accurate pseudo-subscores for each sub-action. The second stage of the proposed score regression framework is designed to generate refined pseudo-subscore values for each sub-action (Fig. 4). In this phase, the overall score label S and the sub-score of the previous sub-action are treated as additional features. Similar to the previous section, feature augmentation (Inline graphic) is performed through a weighted combination of robust spatial-temporal features (Inline graphic) and scoring features (Inline graphic). However, in this case, The scoring feature vector is updated to include both the overall score label and the sub-score from the immediately preceding sub-action, providing contextual evaluation information. Specifically, in this phase, scoring features vector in Equation 4 is computed as follows:

graphic file with name d33e1228.gif 13

The subsequent steps for training the model, including pseudo-subscore generation and overall score prediction, follow the methodology outlined in (Pseudo-subscore Calculation) subsection. After training the model, the updated pseudo-subscores for the various sub-actions of each training video, Inline graphic, are extracted from the trained model. The entire process is summarized in Algorithm 2 in the Appendix.

Fig. 4.

Fig. 4

Illustration of incorporating the immediately preceding sub-score to generate refined pseudo-subscores.

Incorporating all preceding sub-scores   Temporal dependencies in actions extend beyond consecutive sub-actions, as the quality of one can be influenced by multiple preceding ones. For instance, a series of well-executed somersaults may still result in a poor entry if a diver take-off too early or too late, or fails to take-off with sufficient force, cascading their effects through subsequent actions. Conversely, a minor error in the take-off posture may amplify across multiple sub-actions, culminating in a larger splash at the end. This underscores the importance of capturing and modeling the interdependencies among all preceding sub-actions to gain a understanding the dynamics of action execution.

As illustrated in Fig. 5, following feature extraction, for the first sub-action, the overall score label is integrated into the feature set of the first sub-action, resulting in a new feature representation with dimensions (Inline graphic). This integration allows for the generation of enhanced features that capture additional context provided by the overall score. Subsequently, for the second sub-action, both the overall score label and the pseudo-subscore derived from the first substage are included as features. This produces a further enhanced feature set for the second sub-action, resulting in dimensions of (Inline graphic). This process continues, with each subsequent sub-action’s features augmented by the overall score label and all preceding pseudo-subscores, leading to progressively larger feature vectors.

Fig. 5.

Fig. 5

Illustration of incorporating all preceding sub-scores to generate refined pseudo-subscores.

In this case, the scoring features are derived from the overall score label and the cumulative sub-scores from all prior pseudo-subscores, starting from the first up to the current one. So, scoring features vector in Equation 4 is computed as follows:

graphic file with name d33e1273.gif 14

The model training process follows the exact same procedure as described in (Pseudo-subscore Calculation) subsection. During the training phase, the model iteratively updates its parameters to learn the mapping from the enhanced feature vectors to the corresponding pseudo-subscores. This iterative enhancement ensures that the model captures the cumulative information from all previous stages, thereby refining its predictions for subsequent substages. Once the model is trained, the refined pseudo-subscores for training video, denoted as Inline graphic, are obtained from the trained model. The details of this phase are presented in Algorithm 3 in the Appendix.

Embedding pseudo-subscores and the overall score label into sub-action features enriches their representation with predictive information from prior sub-actions. This approach leverages interdependencies to enhance action quality prediction. At each sub-action, features are combined with the overall score label and pseudo-subscores to ensure continuous information flow across the sub-actions.

Multi-substage AQA regression

Finally, after accounting for the dependencies between sub-actions and calculating refined pseudo-subscores for each sub-action, we aim to obtain the final score with higher accuracy than other methods. Since an overall score label is unavailable in an AQA problem and must be estimated, it must be excluded from the feature vector, and a post training process should be performed using the feature vector without the overall score label. Therefore, the robust spatio-temporal features Inline graphic is fed into the a fully connected network consisting of five fully connected layers. The initial layer of this network takes the (K)-dimensional feature vector as input and it employs a decreasing number of nodes in each layer, gradually dropping to one. The computation procedure for this model is described in Equation 15:

graphic file with name d33e1298.gif 15

In this context, Inline graphic is sub-score for each sub-action. Following the procedure outlined in the preceding section, refined pseudo-subscore labels are obtained for each training video.

The total loss function (Inline graphic) for this final training phase is defined as a combination of the sub-score loss (Inline graphic) for each sub-action and the overall score loss (Inline graphic). Specifically, the former is computed as the MSE between the predicted sub-scores and the refined pseudo-subscores, while the latter is computed as the MSE between the predicted overall score and the ground-truth overall score label. This two-part loss function ensures that both the fine-grained and overall quality predictions are accurate. This process is detailed in Fig. 6.

graphic file with name d33e1324.gif 16

After this post training, the test videos are fed into the trained model, and the sub-scores for the various sub-actions of each video and overall score are predicted. This comprehensive multi-stage regression ensures robust and accurate predictions by leveraging both localized quality cues and sequential dependencies.

Fig. 6.

Fig. 6

Illustration of the multi-substage AQA regression.

Experiments

Datasets and evaluation metrics

Datasets: To comprehensively evaluate the proposed framework, experiments were conducted using two publicly available fine-grained AQA datasets, the UNLV-Diving dataset1 and the FineDiving dataset2. The selection of these datasets is motivated by their detailed temporal annotations.

The UNLV-diving dataset1, an extension of the MIT-Diving dataset3, comprises 370 videos of single-person dives. Sourced from the semifinals and finals of the men’s 10-meter platform event at the 2012 London Summer Olympics, these videos were obtained from YouTube, each with a duration of approximately 4 seconds. The dataset provides annotations for an overall score and a difficulty level, which facilitate the calculation of the execution score. It also includes sub-action segmentation labels, which were provided by Xiang et al.22. Importantly, while the dataset includes segmentation labels for each sub-action, it lacks explicit quality score labels for individual sub-actions. The difficulty levels range from 2.7 to 4.1, and the overall scores span from 21.6 to 102.6. The video frames maintain a consistent resolution of Inline graphic pixels. For standardized evaluation, the dataset is partitioned into a training set of 300 videos and a testing set of 70 videos.

The FineDiving dataset2 is a recently proposed large-scale fine-grained benchmark that is crucial to this study. It consists of 3000 diving videos sourced from 30 different international competitions, including the Olympics, World Cup, World Championships, and European Aquatics Championships. The videos, each with an average duration of 4.2 seconds, were collected from YouTube and contain full-length records and slow-motion replays from diverse viewpoints. A key characteristic of FineDiving is its rich, multi-level annotation scheme. The dataset includes labels for 52 action types, 29 sub-action types, and 23 difficulty degree types, along with detailed semantic and temporal structures. This hierarchical granularity makes it an ideal testbed for evaluating models that perform procedure-aware action quality assessment. However, similar to other publicly available benchmarks, this dataset does not include explicit quality scores at the sub-action level. For standardized evaluation, the dataset is partitioned into a training set of 2251 videos and a testing set of 749 videos.

Evaluation metrics: The performance of the proposed framework is evaluated using a comprehensive set of metrics that assess both the ranking and the absolute error of the predicted scores. Following standard practice in AQA, three primary metrics are employed: Spearman’s Rank Correlation (SRC), Mean Squared Error (MSE), and Mean Euclidean Distance (MED).

Spearman’s rank correlation. Spearman’s rank correlation coefficient (SRC) is a non-parametric measure that quantifies the strength and direction of the monotonic relationship between two variables, with values ranging from -1 to 1. In AQA, SRC is the most common primary evaluation criterion as it assesses the agreement between the predicted rankings and the ground-truth rankings of athletes. This metric is defined as Inline graphic:

graphic file with name d33e1395.gif 17

where p and q are the ranking sequence of the predicted and ground-truth scores of the test video sequence, respectively. Inline graphic is the covariance of these ranking sequences, a measure of how the two rankings vary together. while Inline graphic and Inline graphic are the standard deviations of p and q. A larger Inline graphic value indicates a stronger alignment between the predicted and actual score rankings, reflecting higher prediction accuracy.

Mean squared error and mean Euclidean distance. While SRC assesses rank-ordering, additional metrics are necessary to quantify the absolute error of the predicted scores. MSE and MED are utilized for this purpose. MSE measures the average squared difference between predicted and ground-truth scores, defined as:

graphic file with name d33e1433.gif 18

Similarly, MED measures the average absolute difference between predicted and ground-truth scores:

graphic file with name d33e1438.gif 19

where Inline graphic denotes the predicted score, Inline graphic is the ground-truth score, and N is the number of samples. Smaller values for both MSE and MED indicate a closer correspondence between the predicted and ground-truth scores.

Implementation details

The proposed framework is implemented using the PyTorch toolbox97 on a single NVIDIA RTX 4090 GPU. For the unsupervised sub-action segmentation, the TW-FINCH framework96 is employed to partition each video into distinct sub-actions. The segmentation strategy is tailored to each dataset’s characteristics. For the UNLV-Diving dataset, videos are segmented into five specific sub-actions: Start, Take-off, Flight, Entry, and End. The FineDiving dataset, while featuring varying sub-action counts, is primarily composed of dives with three distinct steps (approximately 60%), four steps (approximately 30%), and five steps (approximately 10%). The number of distinct sub-actions within the “Flight” stage can vary based on the specific dive number, leading to a varying total number of sub-actions per video. Consistent with the original authors’ approach2 and to establish a unified procedure-aware framework, a fixed three-step segmentation is adopted for each dive, representing the fundamental phases of Take-off, Flight, and Entry.

For initial feature extraction, a Kinetics-pretrained P3D network11 serves as the backbone, producing 2048-dimensional features. During both training and testing, 16 frames are uniformly sampled from each sub-action clip and resized to Inline graphic pixels. To compute robust features, the self-supervised learning module employs a temporal masking mechanism where d, the number of consecutive frames to mask, is empirically set to 25% of a sub-action’s duration Inline graphic. The overall score for diving performance is computed as the product of execution score and the action’s difficulty level. To maintain consistency, overall scores are normalized using min–max normalization, while execution scores, ranging from 0 to 30, are normalized by dividing by 30. The training process is conducted using the Adam optimizer98, with an initial learning rate of 0.0001, which is reduced by a factor of 0.1 every 30 epochs. To mitigate overfitting, L2 regularization with a weight decay of 0.0005 and a dropout probability of 0.5 are applied. A combination of fully connected networks, one for each sub-action, is employed for the score regression. The study also examined different network configurations, finding that a five-layer architecture provided the best balance between accuracy and computational cost. All final results are reported as the average of 10 experimental runs to ensure stability and reproducibility.

Ablation study

A comprehensive ablation study was performed to rigorously examine the contribution of each major component within the proposed framework. The analysis was structured to isolate and quantify the performance improvements arising from the two central innovations: the self-supervised learning module and the progressive pseudo-subscore learning strategies. To assess the effectiveness of each module, a series of experimental models were trained and evaluated on both the UNLV-Diving1 and FineDiving2 datasets. A brief description of each model configuration is provided below:

  • Baseline Model: The baseline is defined as a conventional single-stage AQA framework that excludes both the self-supervised learning module and the progressive pseudo-subscore learning strategy. This model is trained solely with the overall score loss (Inline graphic) specified in Equation (16), thereby establishing a fundamental benchmark for performance comparison.

  • PSL Model: The PSL framework26 is adopted as a comparative baseline. PSL leverages overall scores as both labels and input features to generate pseudo-subscores for refined multi-substage AQA predictions.

  • Baseline + SA1: This model incorporates the first sequential augmentation strategy, where features are enriched with the pseudo-subscore of the immediately preceding sub-action, to test the efficacy of modeling short-range temporal causality without the proposed self-supervised learning module for robust feature representation.

  • Baseline + SA2: This model extends the previous approach by considering the cumulative impact of all preceding pseudo-subscores, thereby capturing long-range dependencies.

  • Baseline + SSL + SA1: The framework is further enhanced by integrating the proposed self-supervised learning module for feature refinement, combined with the first sequential augmentation strategy.

  • Baseline + SSL + SA2: This represents the full proposed framework, combining the robust features from the self-supervised learning module with the comprehensive long-range temporal dependency modeling.

  • Baseline + SSL + SA2 + GT: This variant corresponds to the full proposed framework in which the unsupervised temporal semantic segmentation is replaced with ground-truth sub-action boundaries. It serves as an upper-bound reference to assess the sensitivity of the framework to sub-actio segmentation accuracy, while retaining the self-supervised feature refinement and long-range temporal dependency modeling components.

As reported in Table 1 and Table 2, the results demonstrate a consistent and incremental enhancement in predictive accuracy as individual components are integrated into the baseline configuration. For the UNLV-Diving dataset, the Baseline model, with an SRC of 0.8700 and high values for MSE and MED of 85.2400 and 5.6600, respectively, confirms the difficulty of the task. The PSL model showed a slight improvement in SRC over the Baseline, obtaining a reduction of 46.5648 in MSE and 0.8582 in MED. This validates that using a semantically segmented approach and generating pseudo-subscores improves the model’s overall prediction accuracy. The introduction of the sequential augmentation strategies (SA1 and SA2) leads to a performance increase, with SRC improving to 0.8798 and 0.8966, and MSE and MED values also decreasing. This validates the importance of modeling temporal dependencies between sub-actions. The most substantial gains are observed with the addition of the self-supervised learning module. A comparison between the Baseline + SA1 model (SRC: 0.8798) and the Baseline + SSL + SA1 model (SRC: 0.9553) shows a significant performance jump, with a substantial reduction in MSE and MED values, indicating that the feature robustness provided by the self-supervised approach is a key driver of accuracy. The full proposed model, Baseline + SSL + SA2, achieves the best performance with a state-of-the-art SRC of 0.9651 and a notably low MSE of 10.3361 and MED of 2.3222, proving the value of combining both robust feature learning and comprehensive temporal dependency modeling.

Table 1.

Ablation studies on UNLV-Diving. The table presents a systematic comparison to quantify the impact of each module on the framework’s performance. The best results for each metric are indicated in bold. (SRC Inline graphic): higher values are better; (MSE and MED Inline graphic): lower values are better.

Method SRC MSE MED
Baseline Model 0.8700 85.2400 5.6600
PSL Model 0.8713 38.6752 4.8018
Baseline + SA1 0.8798 33.9632 4.6138
Baseline + SA2 0.8966 32.4197 4.1447
Baseline + SSL + SA1 0.9553 18.2321 3.0547
Baseline + SSL + SA2 0.9651 10.3361 2.3222
Baseline + SSL + SA2 + GT 0.9663 10.3055 2.3118

Table 2.

Ablation studies on FineDiving. The table presents a systematic comparison to quantify the impact of each module on the framework’s performance. The best results for each metric are indicated in bold. (SRC Inline graphic): higher values are better; (MSE and MED Inline graphic): lower values are better.

Method SRC MSE MED
Baseline Model 0.8576 58.8500 4.7571
PSL Model 0.9176 33.6161 4.2422
Baseline + SA1 0.9311 29.5817 3.8799
Baseline + SA2 0.9393 25.8509 3.6397
Baseline + SSL + SA1 0.9733 16.1701 3.0931
Baseline + SSL + SA2 0.9835 11.9835 2.4370
Baseline + SSL + SA2 + GT 0.9839 11.9720 2.4148

The efficacy of the proposed framework is further reinforced on the more extensive FineDiving dataset, where a similar trend is observed. The PSL model achieved an improvement of 0.1 in SRC over the Baseline, obtaining a reduction of 25.2339 in MSE and 0.5149 in MED. The baseline performance starts with an SRC of 0.8576, which is consistently improved by each subsequent module. The Baseline + SA1 and Baseline + SA2 models again show that modeling temporal dependencies is a crucial step for performance enhancement. The addition of the self-supervised learning module to the Baseline + SA1 model results in a jump in SRC from 0.9311 to 0.9733, with a reduction of 13.4116 in MSE and 0.7868 in MED, confirming the efficacy of the feature refinement strategy on a larger and more diverse dataset. The full proposed model achieves the best performance with an SRC of 0.9835, an MSE of 11.1787, and a MED of 2.4370, demonstrating its robustness and effectiveness across both datasets.

Reliability of unsupervised sub-action segmentation

The reliability of the unsupervised temporal semantic segmentation module constitutes a critical component of the proposed framework, serving as the structural foundation for the subsequent pseudo-subscore learning. The following analysis evaluates the fidelity of the TW-FINCH96 module through both qualitative and quantitative lenses, while elucidating the mechanisms that ensure framework robustness against potential boundary misalignments.

1. Segmentation accuracy:   Figure 7 provides a qualitative visualization of the temporal semantic segmentation results compared against Ground-Truth (GT) annotations. As depicted in Fig. 7 (a) for the UNLV-Diving dataset and Fig. 7 (b) for FineDiving, the unsupervised temporal boundaries exhibit a reasonable alignment with the semantic transitions of the action sequences (e.g., the progression from “Take-off” to “Flight”). Although temporal deviations are discernible—primarily at rapid transition points—the segmentation effectively preserves the coarse-to-fine temporal structure essential for downstream assessment. To quantitatively assess segmentation quality, the TW-FINCH algorithm is evaluated using the Average Intersection Over Union (AIoU@0.5) metric on both benchmark datasets. The module achieves an AIoU@0.5 of 74.7276% on UNLV-Diving and 80.1520% on FineDiving. Such metrics confirm that the unsupervised clustering successfully groups temporally adjacent frames with coherent semantic content, yielding a structural prior that closely mirrors manual annotations.

Fig. 7.

Fig. 7

Visualization of sub-action segmentation reliability on (a) the UNLV-Diving dataset (5 sub-actions) and (b) the FineDiving dataset (3 sub-actions). Color-coded bars denote the temporal duration of each sub-action phase. The overlap between Ground-Truth (top bars) and predicted (bottom bars) intervals demonstrates the performance of the unsupervised sub-action segmentation module in capturing semantic boundaries. Quantitatively, the sub-action segmentation module achieves a promising alignment with human annotations, yielding AIoU@0.5 scores of 74.7276% and 80.1520% for (a) and (b), respectively.

2. Robustness via self-supervised learning:   Despite promising segmentation accuracy, minor boundary jitters remain inevitable in unsupervised settings. However, importantly, the proposed framework is explicitly designed to tolerate such inaccuracies. Resilience stems primarily from the Self-Supervised Learning for Robust Semantic-Aware Sub-Action Representation module. Specifically, during self-supervised training, contiguous sequences of frames are randomly masked, and the model is optimized to reconstruct sub-action representations consistent with the unmasked context. This training paradigm encourages the encoder to infer semantic information from surrounding temporal cues, thereby reducing sensitivity to missing or misaligned frames. Consequently, even if the segmentation module misassigns a few frames at the boundary (effectively creating “noisy” input), the feature extractor—trained to handle masked sequences—maintains a stable and robust representation. Small boundary jitters effectively manifest as localized temporal noise, to which the representation learning process is inherently insensitive. Furthermore, the Progressive Pseudo-subscore Learning strategy reinforces temporal stability by conditioning each sub-action representation on the accumulated pseudo-subscores of preceding sub-actions. This strategy distributes structural information across the entire sequence, minimizing the impact of localized misalignment at stage boundaries while preserving both short-range continuity and long-range temporal coherence.

3. Impact on downstream assessment performance:   To empirically quantify the propagation of segmentation errors to the final quality assessment, an upper-bound analysis was conducted by replacing the unsupervised TW-FINCH sub-action segmentation with GT temporal annotations. As detailed in the ablation study (refer to the “Baseline + SSL + SA2 + GT” entry in Table 1 and Table 2), the introduction of perfect boundary information yields only negligible performance gains across all evaluation metrics. Specifically, on the UNLV-Diving dataset, the SRC increases marginally from 0.9651 to 0.9663. Similarly, the performance improvement on FineDiving is limited, with SRC rising from 0.9835 to 0.9839. This minimal performance gap substantiates that the overall assessment accuracy is not heavily contingent on precise boundary localization. Instead, segmentation errors are effectively absorbed and corrected by the self-supervised refinement and progressive pseudo-subscore learning modules, yielding results that remain stable under variations in sub-action length, boundary placement, and temporal alignment. This property is particularly advantageous for real-world AQA scenarios, where fine-grained temporal annotations are rarely available, and segmentation noise is unavoidable.

Comparison with the state-of-the-art methods

The performance of the proposed framework is quantitatively assessed and compared with existing AQA methods, trained and evaluated on the UNLV-Diving and FineDiving datasets. The results, summarized in Table 3 and Table 4, demonstrate the consistent advantages of the integrated self-supervised feature refinement and progressive pseudo-subscore learning mechanisms, thereby validating their contribution to performance improvement.

Table 3.

Comparison of the proposed framework’s performance with existing AQA methods on UNLV-Diving. The best results for each metric are indicated in bold, while the second-best results are underlined. A dash (—) indicates that the metric was not reported by the original authors. (SRC Inline graphic): higher values are better; (MSE and MED Inline graphic): lower values are better.

Method SRC MSE MED
C3D-SVR1 0.7400 - -
C3D+CNN18 0.8000 - 7.7800
S3D22 0.8600 97.4600 6.9000
ScoringNet24 0.8400 - 5.3600
C3D-AVG-STL50 0.8300 - -
MUSDL45 0.8738 129.5963 -
Metric Learning75 0.7600 105.6200 -
CoRe83 0.8589 7.8716 -
TAL29 0.8649 - -
MSRM25 0.8798 73.92 -
GDLT30 0.8735 78.9092 -
HGCN6 0.8871 101.4813 -
PSL26 0.8713 38.5752 4.8018
SSPR37 0.9257 - -
DAE46 0.8440 85.3052 -
T2CR38 0.8334 96.7862 -
CoFInAl95 0.8652 150.5001 -
Ours (Baseline + SSL + SA2) 0.9651 10.3361 2.3222

Table 4.

Comparison of the proposed framework’s performance with existing AQA methods on FineDiving. The best results for each metric are indicated in bold, while the second-best results are underlined. A dash (-) indicates that the metric was not reported by the original authors. (SRC Inline graphic): higher values are better; (MSE and MED Inline graphic): lower values are better.

Method SRC MSE MED
MUSDL45 0.8891 48.9616 -
CoRe83 0.9406 26.7377 -
GDLT30 0.9351 29.2547 -
UD-AQA47 0.9341 - -
TSA2 0.9203 - -
ASTRM34 0.9222 - -
HGCN6 0.9381 26.3895 -
MCoRe85 0.9232 - -
T2CR38 0.9275 27.2114 -
STSA81 0.9397 - -
FineParser80 0.9435 - -
DAE46 0.9356 27.1739 -
CoFInAl95 0.9317 36.4681 -
NS-AQA68 0.9610 - -
RICA249 0.9421 - -
Ours (Baseline + SSL + SA2) 0.9835 11.1787 2.4369

As shown in Table 3 for the UNLV-Diving dataset, the proposed framework achieves the highest SRC value of 0.9651, indicating a strong correlation between the predicted and ground-truth rankings. The framework also achieves notably low MED and MSE values of 10.3361 and 2.3222, respectively, confirming reduced prediction errors. While the CoRe method83 reports a lower MSE of 7.8716, the proposed framework’s SRC is 0.1062 higher, highlighting its overall effectiveness. Compared to the most similar existing method, PSL26, the proposed approach improves the SRC value by 0.0938 and reduces the MSE and MED values by 28.3391 and 2.4796, respectively, highlighting its superior accuracy in predicting both rankings and absolute scores.

Table 4 presents a comparison with state-of-the-art baselines on the more complex and extensive FineDiving dataset. The results demonstrate the framework’s superior performance. The proposed framework achieves the highest performance across all three metrics, setting new state-of-the-art scores for SRC, MSE, and MED of 0.9835, 11.1787, and 2.4370, respectively. This consistent outperformance is a strong indicator of the framework’s robustness, particularly given the dataset’s diverse video sources and the varying number of sub-actions per dive. The results on both datasets collectively validate the efficacy of combining unsupervised sub-action decomposition with a self-supervised learning strategy, confirming that robust feature representation and explicit temporal modeling are key to achieving highly accurate AQA.

Quantitative results for the UNLV-Diving dataset, presented in Fig. 8, further underscore the framework’s performance. The figure depicts outcomes for all 70 test videos when the final algorithm, which incorporates self-supervised learning and all preceding sub-scores, is applied. These results demonstrate high accuracy for most test samples, with only a few cases exhibiting notable deviations.

Fig. 8.

Fig. 8

Scoring results of the proposed framework on the UNLV-Diving dataset for all 70 test videos.

A detailed error analysis was conducted to examine the robustness of the proposed framework, particularly in addressing challenging cases and dataset imbalances that have historically limited the performance of prior methods such as MSRM25. The investigation, based on samples from the UNLV-Diving dataset, demonstrates that the framework achieves higher accuracy in evaluating both high-difficulty actions and low-scoring videos, while effectively mitigating the severe score overestimation observed in earlier approache25. Figure 9 illustrates a direct comparison of predictions from two configurations of the proposed framework, Baseline + SSL + SA1 and Baseline + SSL + SA2, with MSRM25. The comparison is conducted across three representative challenging cases with an average difficulty level of 3.5, including reverse dives and armstand sequences.

Fig. 9.

Fig. 9

Samples with diverse and inaccurate scores from the UNLV-Diving dataset. Comparative results of predicted scores are shown for the MSRM method25 and two configurations of the proposed framework. The green value denotes the ground-truth score, the purple value indicates the MSRM prediction, the blue value corresponds to the proposed framework with short-range dependencies, and the red value represents the proposed framework with long-range dependencies.

The UNLV-Diving dataset poses a considerable challenge due to its inherent score imbalance, where overall scores range from 21.6 to 102.6 with an average near 78, and high-scoring samples are disproportionately represented. This imbalance has been shown to degrade the predictive accuracy of earlier models, particularly for low-scoring instances. For example, in MSRM25, Sample 235 with a ground-truth score of 21.60 was overestimated at 73.84, and Sample 93 with a ground-truth score of 37.40 was predicted at 59.66. In contrast, the proposed framework produced substantially closer estimates of 35.09 and 40.32, respectively. This improvement can be attributed to the self-supervised learning module, which extracts more robust and semantically meaningful representations, and the progressive feature augmentation strategy, which effectively captures sequential dependencies. By reducing systematic overestimation and maintaining stability under data imbalance, the proposed framework demonstrates enhanced robustness and reliability in action quality assessment.

Interpretability of fine-grained quality assessment for diving sub-actions

Experimental results indicate that the integration of sub-scores enhances the accuracy of overall score prediction while simultaneously providing fine-grained feedback at the sub-action level. To validate the meaningfulness and utility of this analysis, a qualitative examination was performed, linking numerical predictions to visual evidence. Figure 10 presents the predicted sub-scores for three representative videos from the UNLV-Diving dataset, each with a contrasting overall quality score. For this analysis, the full framework (Baseline + SSL + SA2) was selected to output the sub-scores. The visualization provides powerful stage-specific feedback, highlighting the model’s performance at each of the five sub-actions. An analysis of the proposed framework’s performance across different sub-actions reveals that the sub-scores for the first three stages of a dive are typically close in value across samples. In contrast, the sub-actions of “Entry” and “End” exhibit the greatest score diversity. This observation highlights the model’s high sensitivity to these two sub-actions, indicating they are key determinants of a dive’s final quality. For instance, a high-scoring sample (ground-truth: 93.60, predicted: 92.66) demonstrates a high sub-score of 0.71 for the “Entry” stage, which corresponds to a standard body position and a clean entry without a bend, a metric aligned with expert judging criteria. Similarly, the “End” stage, which evaluates the magnitude of the water splash, received a high predicted sub-score of 0.93. This high value objectively reflects a minimal splash, serving as an indicator of precise execution and proper entry posture. Conversely, a low-scoring sample (ground-truth: 42.90, predicted: 41.26) received notably lower predicted sub-scores for the “Entry” and “End” sub-actions (0.47 and 0.24, respectively), accurately reflecting a poor landing and a large, uncontrolled splash.

Fig. 10.

Fig. 10

Visualization of predicted sub-scores for three samples on the UNLV-Diving dataset with low, medium, and high overall scores. Ground-truth overall scores are shown in green, and predicted overall scores in blue.

The framework’s ability to provide fine-grained insights is further demonstrated on the FineDiving dataset. Within this dataset, the “Entry” sub-action displays substantially higher score variability than the “Take-off” and “Flight” sub-actions, confirming its pivotal role in accurately modeling performance differences and ensuring a reliable assessment of overall dive quality. Figure 11 provides a comparative analysis of a high-scoring sample and three low-scoring samples, with a specific focus on the critical “Entry” sub-action. For each sample, the figure presents keyframes from the third sub-action, capturing the critical moments of entry into the water and the resulting splash, along with the corresponding predicted sub-score, ground-truth overall score, and predicted overall score. The high-scoring sample (ground-truth: 100.80, predicted: 92.69) exemplifies the ideal execution of this stage, with keyframes revealing a perfectly vertical and straight body posture at the moment of entry. This perpendicular form, which aligns with expert judging criteria, is associated with a minimal splash and is correctly reflected by a high predicted sub-score of 0.8267. Conversely, the three low-scoring samples clearly demonstrate suboptimal entry forms, with keyframes consistently showing a significant body bend or a lack of verticality, leading to a large, uncontrolled splash. In each of these cases, the model correctly assigns a low predicted sub-score, demonstrating a strong correlation between its prediction and the visible performance flaws. This visual evidence provides crucial support for the framework’s core premise: it learns to identify the same critical performance metrics as human judges. The strong correspondence between a diver’s technical execution and the predicted sub-score confirms that the framework is not simply regressing an arbitrary number; instead, it provides meaningful, interpretable feedback that offers a powerful tool for coaches and athletes to identify the root cause of performance flaws and a clear roadmap for targeted skill improvement. This capability represents a critical advantage over methods that only provide a single score. This analysis validates that the proposed framework successfully overcomes the challenge of sparse fine-grained annotations by generating pseudo-subscores that are a direct reflection of visible performance quality.

Fig. 11.

Fig. 11

Visualization of predicted sub-scores for the last sub-action of selected FineDiving samples. Ground-truth overall scores are shown in green, predicted overall scores in blue, and the last value in each row indicates the predicted sub-score for the “Entry” sub-action.

To rigorously validate the interpretability of the generated pseudo-subscores and their alignment with expert judging criteria, a stratified distribution analysis was conducted using the FineDiving dataset. Theoretically, semantically meaningful pseudo-subscores must exhibit monotonic behavior relative to ground-truth quality; specifically, superior athletic performances should correlate with elevated sub-scores in critical sub-actions. Per FINA judging regulations1, the “Entry” sub-action constitutes the most heavily weighted component of a dive, often dictating the overall assessment. Consequently, the predicted sub-score associated with the “Entry” sub-action is expected to demonstrate the strongest correlation with the final ground-truth score, surpassing earlier sub-actions such as the “Take-off”. To facilitate visualization, test samples were stratified into three distinct tiers based on ground-truth overall scores: low-quality (Inline graphic), medium-quality ([50–80]), and high-quality (Inline graphic).

Figure 12 illustrates the distribution of pseudo-subscores for the three diving sub-actions: Take-off (Sub-action 1), Flight (Sub-action 2), and Entry (Sub-action 3). A monotonic increase in pseudo-subscores is evident across the quality cohorts for all sub-actions, providing robust quantitative evidence that the generated scores serve as reliable proxies for action quality. Corroborating FINA regulations that emphasize the dominance of the “Entry” sub-action, Fig. 12 (c) reveals that low-quality performances exhibit markedly lower pseudo-subscores in this stage compared to medium- and high-quality executions. The low-quality cluster is characterized by high variance and depressed median scores, effectively capturing the severe penalties associated with “splash” errors. Conversely, the high-quality cluster displays a compact distribution with elevated scores, consistent with the “rip entry” technique rewarded by human judges.

Fig. 12.

Fig. 12

Stratified distribution analysis of generated pseudo-subscores across three distinct quality clusters (Low Inline graphic, Medium [50–80], High Inline graphic) on the FineDiving dataset. The monotonic increase in median scores from low to high-quality tiers across all sub-actions—Take-off (a), Flight (b), and Entry (c)—quantitatively validates the semantic consistency of the unsupervised pseudo-subscores. Notably, the Entry sub-action (c) exhibits the highest variance in the low-quality cluster, effectively capturing the severe penalties associated with splash errors, whereas high-quality dives demonstrate a compact, high-value distribution indicative of consistent execution.

Collectively, the results quantitatively confirm that the proposed framework has successfully identified the “Entry” sub-action as the primary discriminative determinant for assessment, assigning high sensitivity to such critical stages while maintaining consistency in preparatory phases.

Generalizability and Limitations

While the experimental evaluation in this study focuses on the UNLV-Diving and FineDiving datasets, this selection is driven by the rigorous benchmarking standards of the AQA community rather than inherent architectural constraints. A critical attribute of the proposed framework is its domain-agnostic design; none of the constituent modules—including the unsupervised sub-action segmentation, self-supervised learning for robust semantic-aware sub-action representation, pseudo-subscore calculation, or progressive pseudo-subscore learning—rely on diving-specific heuristics such as splash detection or pose-template constraints. Instead, the framework operates solely on generic spatio-temporal motion representations. As such, diving serves as a challenging evaluation domain characterized by high temporal complexity, rapid phase transitions, and subtle execution variations, rather than a restrictive application setting.

The procedural nature of the framework renders it broadly applicable to a wide range of multi-stage skill assessment scenarios beyond competitive sports. Potential extensions include medical training environments (e.g., surgical skill evaluation or endoscopic navigation), industrial process monitoring (e.g., compliance with standardized assembly procedures), and rehabilitation or physical therapy assessment, provided that the target activity exhibits an interpretable sequential structure. The progressive pseudo-subscore learning strategy is specifically designed to model cumulative execution quality across ordered sub-actions, enabling a natural transfer to other structured activities composed of semantically meaningful stages.

Nevertheless, a practical limitation of the proposed framework lies in its reliance on an overall performance score for supervision during training. While the framework explicitly eliminates the need for temporal boundary annotations through unsupervised sub-action segmentation, it still requires an overall score label to optimize the assessment model. This requirement reflects a common characteristic of AQA formulations and does not restrict inference-time deployment, but it does define the expected annotation availability for extending the approach to new domains.

In summary, the proposed method is well-suited for diverse action categories that involve structured, multi-stage execution and provide overall quality annotations. Within this setting, the framework offers a scalable and annotation-efficient solution for fine-grained quality assessment without imposing additional supervision requirements at the sub-action level.

Computational efficiency analysis

To assess the practical scalability of the proposed framework for real-world deployment, a comprehensive profiling of runtime performance and memory consumption is conducted relative to the PSL baseline26. The evaluation decomposes both pipelines into their constituent modules to isolate individual computational costs. All measurements are obtained using the UNLV-Diving dataset to ensure a consistent and standardized experimental setting. A detailed summary of parameter counts, training duration, inference latency, and memory usage for each module is reported in Table 5.

Table 5.

Comparative analysis of computational efficiency for the PSL baseline and the proposed framework. Cells marked with a dash (-) denote modules where a specific runtime or parameter metric does not apply, such as components relying on pre-trained backbones or unsupervised algorithms without a standard training epoch. Results indicate that the proposed framework performs nearly as well as the baseline in terms of runtime and memory, yielding comparable scalability. Note that the additional offline training overhead for feature extraction is justified by the framework’s capacity to leverage large-scale datasets for enhanced representation learning.

Module Method Train Time Inference Params Memory
(Sec./Epoch) Inline graphic (M) (MB)
PSL26
1. Sub-action Segmentation ED-TCN23 (Supervised) 5.956 0.064 10.488 39.978
2. Feature Extraction P3D - 0.038 66.492 253.639
3. Pseudo-subscore Calculation Latent Sub-score Generation 11.733 - 5.417 20.637
4. AQA Network Multi-substage AQA Regression 11.733 0.027 5.414 20.637
Proposed Framework
1. Sub-action Segmentation TW-FINCH96 (Unsupervised) - 0.025 - 3.679
2. Feature Extraction SSL-Refined P3D 44.850 0.061 14.401 54.931
3. Pseudo-subscore Calculation Latent Sub-score Generation 11.733 - 5.417 20.637
4. Robust Pseudo-subscore Calculation Sequential Feature Augmentation 11.733 - 5.420 20.675
5. AQA Network Multi-substage AQA Regression 11.733 0.027 5.414 20.637

Inline graphic Inference metrics are reported as the mean time required to process a single video, measured on a test set with an average length of 104 frames.

Runtime Profile.   An analysis of the training phase reveals clear differences in computational characteristics between the proposed framework and the baseline. For sub-action segmentation, the framework adopts the unsupervised TW-FINCH algorithm96. As a non-parametric clustering approach, the module eliminates training costs entirely. In contrast, the supervised ED-TCN23 employed by PSL requires a dedicated training period. Although supervised segmentation can be substituted with unsupervised alternatives within the proposed framework, the analysis in the Reliability of Unsupervised Sub-action Segmentation subsection demonstrates that the self-supervised representation learning mechanism effectively compensates for potential segmentation inaccuracies. Consequently, the unsupervised formulation is preferable, offering substantial advantages in computational efficiency and memory usage while removing the dependency on labor-intensive temporal boundary annotations.

Regarding spatio-temporal feature extraction, although both frameworks utilize a P3D backbone, the proposed framework integrates a Self-Supervised Learning (SSL) mechanism to enhance representation robustness. Capitalizing on the label-free nature of the framework, the model leverages diverse large-scale video data by pre-training the SSL-Refined P3D architecture on a subset of 4,244 sequences from the Kinetics-400 dataset99, followed by fine-tuning on the target diving data. Such a process incurs a training overhead of 44.85 seconds per epoch. While the standard P3D backbone in PSL typically relies on fixed pre-trained weights—thereby bypassing specific training costs—the additional offline computational investment is justified by the generation of robust spatio-temporal representations that capture fundamental motion dynamics prior to domain-specific optimization.

In the subsequent stages, specifically the Pseudo-subscore Calculation, Robust Pseudo-subscore Calculation, and the final AQA Network, the training durations are almost identical for both the proposed framework and the baseline, averaging 11.733 seconds per epoch. Uniformity arises from the homologous architecture of these modules, which share nearly equivalent parameter counts (approximately 5.4 M), resulting in consistent computational throughput during the backward pass. In terms of deployment efficiency, the inference latency for processing novel video sequences remains consistently low. As summarized in Table 5, the proposed framework exhibits highly manageable runtime metrics, ensuring suitability for large-scale real-world datasets.

Memory Efficiency.   Memory consumption constitutes a critical determinant for deployment on resource-constrained hardware. Given that memory is allocated dynamically per module and released upon completion, the peak requirement is dictated by the most resource-intensive component rather than the aggregate of all stages. The PSL baseline faces significant constraints from the memory-heavy supervised ED-TCN segmentation (39.978 MB) and the standard P3D extraction (253.639 MB). In contrast, the proposed framework substantially mitigates these bottlenecks. The unsupervised segmentation module operates with a minimal footprint of approximately 3.679 MB, while the SSL-Refined P3D reduces the parameter count from 66.49 M to 14.40 M, lowering the feature extraction memory requirement to 54.931 MB. Remaining downstream modules exhibit comparable memory usage across both methods, indicating that the incorporated robustness mechanisms do not impose excessive memory demands.

Collectively, the reported metrics indicate that the proposed framework achieves a superior trade-off between computational resource utilization and assessment accuracy. By reducing inference time and peak memory usage relative to PSL, while simultaneously removing reliance on supervised temporal annotations and enhancing robustness to segmentation inaccuracies, the proposed design is well-positioned and scalable for practical AQA applications under realistic computational constraints. Notably, reported memory usage metrics correspond to per-batch consumption, assuming 32-bit floating-point precision. Furthermore, while increasing the batch size facilitates parallelization and accelerates training, the parameter requires careful calibration to ensure total memory demand remains within available GPU limits.

Conclusion

This study introduced a novel framework to address the critical challenges of fine-grained action quality assessment, specifically the systemic lack of explicit sub-score labels and the limited availability of datasets with temporal annotations. The framework successfully moves beyond conventional single-score regression models by integrating unsupervised temporal segmentation with a self-supervised learning strategy to achieve a robust and interpretable multi-stage evaluation. The core technical innovation lies in the symbiotic relationship between feature robustness and temporal modeling. An unsupervised segmentation technique was employed to automatically identify semantically meaningful sub-actions, circumventing the need for costly manual annotations. Following this, a novel self-supervised learning module was introduced to significantly enhance spatio-temporal feature representations, making them highly discriminative for subtle motion nuances and robust to temporal irregularities. These robust features were then leveraged within a progressive pseudo-subscore learning mechanism that iteratively models the sequential dependencies between sub-actions, explicitly capturing both short-range and cumulative long-range causal impacts. This iterative refinement process enables robust pseudo-subscore generation and leads to superior overall score prediction. Comparative analysis on the UNLV-Diving and FineDiving datasets confirms that the proposed multi-substage model achieves state-of-the-art performance on the SRC metric, demonstrating superior rank-ordering capability. A comprehensive ablation study confirmed the synergistic contribution of each module, validating that both the self-supervised feature refinement and the progressive pseudo-subscore learning are indispensable for the observed performance gains. Complementing the promising empirical performance, we provided a rigorous computational profile indicating that the framework maintains high inference efficiency and a manageable memory footprint, thereby confirming its scalability for practical, real-world deployment. Furthermore, the qualitative analysis confirmed that the framework generates pseudo-subscores that are strongly correlated with expert judging criteria, offering precise, actionable feedback for athletes.

Limitations

Despite its superior performance, the proposed framework exhibits several limitations, as detailed in the quantitative and qualitative analysis presented in the results section. First, the framework’s evaluation is currently constrained to the diving domain. Its effectiveness on actions with different temporal or spatial structures, such as gymnastics or continuous activities, remains an open question, and its direct applicability may be limited without further validation. Second, a limitation concerns the precision of the predicted sub-scores for the early substages. Observations indicate that the pseudo-scores for the initial sub-actions (e.g., the first three for the UNLV-Diving dataset and the first two for the FineDiving dataset) tend to be clustered within a small range. In contrast, the scores for later, more decisive stages, such as “Entry” and “End”, show a distinct and wider distribution. This suggests difficulty in capturing fine-grained discriminative features during the rapid postural changes of the early phases. Additionally, the performance of the framework is challenged by data imbalance, as professional-level datasets are dominated by high scores, leading to higher prediction errors for the rare, low-scoring samples where the model has seen fewer examples.

Future directions

Future work will focus on enhancing the robustness of the framework and expanding its applicability. A crucial priority is the collaborative development of new datasets that include ground-truth sub-action-level labels to facilitate a direct and precise numerical evaluation of the predicted sub-scores. Additionally, future efforts will aim to validate and extend the framework’s coverage to long-duration action sequences, such as synchronized diving and complex gymnastic routines, to further prove the efficacy of our progressive long-range dependency modeling approach in more varied and complex scenarios.

Algorithms

The proposed framework is summarized in the following algorithms for clarity and comprehensiveness.

Algorithm 1.

Algorithm 1

Pseudo-subscore calculation algorithm.

Algorithm 2.

Algorithm 2

Incorporating the immediately preceding sub-score algorithm (Baseline + SSL + SA1).

Algorithm 3.

Algorithm 3

Incorporating all preceding sub-scores algorithm (Baseline + SSL + SA2).

Author contributions

M.M. designed and conducted the research, analyzed the data, and drafted the manuscript. E. FE. provided overall guidance, supervision, and critical revisions to the manuscript. A.V. and A.H. contributed methodological expertise, provided additional supervision, and assisted with revising the manuscript. All authors read and approved the final version of the manuscript.

Funding

The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

Data availability

The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Parmar, P. & Tran Morris, B. Learning to score olympic events. Proceedings of the IEEE conference on computer vision and pattern recognition workshops 20–28. 10.1109/CVPRW.2017.16 (2017).
  • 2.Xu, J. et al. Finediving: A fine-grained dataset for procedure-aware action quality assessment. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2949–2958. 10.1109/CVPR52688.2022.00296 (2022).
  • 3.Pirsiavash, H., Vondrick, C. & Torralba, A. Assessing the quality of actions. European conference on computer vision 556–571. 10.1007/978-3-319-10599-4_36 (2014).
  • 4.Wang, H., Kläser, A., Schmid, C. & Liu, C.-L. Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis.103, 60–79. 10.1007/s11263-012-0594-8 (2013). [Google Scholar]
  • 5.Wang, H. & Schmid, C. Action recognition with improved trajectories. Proceedings of the IEEE international conference on computer vision 3551–3558. 10.1109/ICCV.2013.441 (2013).
  • 6.Zhou, K., Ma, Y., Shum, H. P. H. & Liang, X. Hierarchical graph convolutional networks for action quality assessment. IEEE Trans. Circuits Syst. Video Technol.33, 7749–7763. 10.1109/TCSVT.2023.3281413 (2023). [Google Scholar]
  • 7.Karpathy, A. et al. Large-scale video classification with convolutional neural networks. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition 1725–1732. 10.1109/CVPR.2014.223 (2014).
  • 8.Simonyan, K. & Zisserman, A. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst.27. Preprint at 10.48550/arXiv.1406.2199 (2014).
  • 9.Tran, D., Bourdev, L., Fergus, R., Torresani, L. & Paluri, M. Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE international conference on computer vision 4489–4497. 10.1109/ICCV.2015.510 (2015).
  • 10.Carreira, J. & Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 6299–6308. 10.1109/CVPR.2017.502 (2017).
  • 11.Qiu, Z., Yao, T. & Mei, T. Learning spatio-temporal representation with pseudo-3d residual networks. Proceedings of the IEEE International Conference on Computer Vision 5533–5541. 10.1109/ICCV.2017.590 (2017).
  • 12.Su, Y., Tan, Y., An, S., Xing, M. & Feng, Z. Semantic-driven dual consistency learning for weakly supervised video anomaly detection. Pattern Recognit.157, 110898. 10.1016/j.patcog.2024.110898 (2025). [Google Scholar]
  • 13.Su, Y., Tan, Y., An, S. & Xing, M. Anomalies cannot materialize or vanish out of thin air: A hierarchical multiple instance learning with position-scale awareness for video anomaly detection. Expert Syst. Appl.254, 124392. 10.1016/j.eswa.2024.124392 (2024). [Google Scholar]
  • 14.Su, Y., Xing, M., An, S., Peng, W. & Feng, Z. VDARNet: Video disentangling attentive relation network for few-shot and zero-shot action recognition. Ad Hoc Networks113, 102380. 10.1016/j.adhoc.2020.102380 (2021). [Google Scholar]
  • 15.Venkataraman, V., Vlachos, I. & Turaga, P. K. Dynamical Regularity for Action Analysis. BMVC67, 1–12. 10.5244/C.29.67 (2015). [Google Scholar]
  • 16.Parisi, G. I., Magg, S. & Wermter, S. Human motion assessment in real time using recurrent self-organization. 2016 25th IEEE international symposium on robot and human interactive communication (RO-MAN) 71–76. 10.1109/ROMAN.2016.7745093 (2016).
  • 17.Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput.9, 1735–1780. 10.1162/neco.1997.9.8.1735 (1997). [DOI] [PubMed] [Google Scholar]
  • 18.Li, Y., Chai, X. & Chen, X. End-to-end learning for action quality assessment. Pacific Rim Conference on Multimedia 125–134. 10.1007/978-3-030-00767-6_12 (2018).
  • 19.Parmar, P. & Morris, B. T. HalluciNet-ing spatiotemporal representations using a 2D-CNN. Signals2, 604–618. 10.3390/signals2030037 (2021). [Google Scholar]
  • 20.Wang, B., Zhang, X. & Zhao, Y. Exploring sub-action granularity for weakly supervised temporal action localization. IEEE Trans. Circuits Syst. Video Technol.32, 2186–2198. 10.1109/TCSVT.2021.3089323 (2021). [Google Scholar]
  • 21.Liu, C., Wu, X. & Jia, Y. A hierarchical video description for complex activity understanding. Int. J. Comput. Vis.118, 240–255. 10.1007/s11263-016-0897-2 (2016). [Google Scholar]
  • 22.Xiang, X., Tian, Y., Reiter, A., Hager, G. D. & Tran, T. D. S3D: Stacking segmental P3D for action quality assessment. 2018 25th IEEE International conference on image processing (ICIP) 928–932. 10.1109/ICIP.2018.8451364 (2018).
  • 23.Lea, C., Flynn, M. D., Vidal, R., Reiter, A. & Hager, G. D. Temporal convolutional networks for action segmentation and detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 156–165. 10.1109/CVPR.2017.113 (2017).
  • 24.Li, Y., Chai, X. & Chen, X. ScoringNet: Learning key fragment for action quality assessment with ranking loss in skilled sports. Asian Conference on Computer Vision 149–164. 10.1007/978-3-030-20876-9_10 (2018).
  • 25.Dong, L.-J. et al. Learning and fusing multiple hidden substages for action quality assessment. Knowl.-Based Syst.229, 107388. 10.1016/j.knosys.2021.107388 (2021). [Google Scholar]
  • 26.Zhang, H.-B., Dong, L.-J., Lei, Q., Yang, L.-J. & Du, J.-X. Label-reconstruction-based pseudo-subscore learning for action quality assessment in sporting events. Appl. Intell.53, 10053–10067. 10.1007/s10489-022-03984-5 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. Preprint at 10.48550/arXiv.1706.0376230 (2017).
  • 28.Xu, C. et al. Learning to score figure skating sport videos. IEEE Trans. Circuits Syst. Video Technol.30, 4578–4590. 10.1109/TCSVT.2019.2927118 (2019). [Google Scholar]
  • 29.Lei, Q., Zhang, H. & Du, J. Temporal attention learning for action quality assessment in sports video. Signal Image Video Process.15, 1575–1583. 10.1007/s11760-021-01890-w (2021). [Google Scholar]
  • 30.Xu, A., Zeng, L.-A. & Zheng, W.-S. Likert scoring with grade decoupling for long-term action assessment. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 3232–3241. 10.1109/CVPR52688.2022.00323 (2022).
  • 31.Iyer, A., Alali, M., Bodala, H. & Vaidya, S. Action quality assessment using transformers. Preprint at10.48550/arXiv.2207.12318 (2022). [Google Scholar]
  • 32.Zhang, Y., Xiong, W. & Mi, S. Learning time-aware features for action quality assessment. Pattern Recognit. Lett.158, 104–110. 10.1016/j.patrec.2022.04.015 (2022). [Google Scholar]
  • 33.Ji, Y. et al. Localization-assisted uncertainty score disentanglement network for action quality assessment. Proceedings of the 31st ACM International Conference on Multimedia 8590–8597. 10.1145/3581783.3613795 (2023).
  • 34.Lian, P.-X. & Shao, Z.-G. Improving action quality assessment with across-staged temporal reasoning on imbalanced data. Appl. Intell.53, 30443–30454. 10.1007/s10489-023-05166-3 (2023). [Google Scholar]
  • 35.Liu, Y., Cheng, X. & Ikenaga, T. A figure skating jumping dataset for replay-guided action quality assessment. Proceedings of the 31st ACM International Conference on Multimedia 2437–2445. 10.1145/3581783.3613774 (2023).
  • 36.Zhang, S. et al. Logo: A long-form video dataset for group action quality assessment. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2405–2414. 10.1109/CVPR52729.2023.00238 (2023).
  • 37.Huang, F. & Li, J. Assessing action quality with semantic-sequence performance regression and densely distributed sample weighting. Appl. Intell.54, 3245–3259. 10.1007/s10489-024-05349-6 (2024). [Google Scholar]
  • 38.Ke, X., Xu, H., Lin, X. & Guo, W. Two-path target-aware contrastive regression for action quality assessment. Inf. Sci.664, 120347. 10.1016/j.ins.2024.120347 (2024). [Google Scholar]
  • 39.Pan, J.-H., Gao, J. & Zheng, W.-S. Action assessment by joint relation graphs. Proceedings of the IEEE/CVF international conference on computer vision 6331–6340. 10.1109/ICCV.2019.00643 (2019).
  • 40.Zeng, L.-A. et al. Hybrid dynamic-static context-aware attention network for action assessment in long videos. Proceedings of the 28th ACM international conference on multimedia 2526–2534. 10.1145/3394171.3413560 (2020).
  • 41.Chen, X. et al. Sportscap: Monocular 3d human motion capture and fine-grained understanding in challenging sports videos. Int. J. Comput. Vis.129, 2846–2864. 10.1007/s11263-021-01486-4 (2021). [Google Scholar]
  • 42.Nagai, T., Takeda, S., Matsumura, M., Shimizu, S. & Yamamoto, S. Action quality assessment with ignoring scene context. 2021 IEEE International Conference on Image Processing (ICIP) 1189–1193. 10.1109/ICIP42928.2021.9506257 (2021).
  • 43.Wang, S., Yang, D., Zhai, P., Chen, C. & Zhang, L. TSA-Net: Tube self-attention network for action quality assessment. Proceedings of the 29th ACM international conference on multimedia 4902–4910. 10.1145/3474085.3475438 (2021).
  • 44.Huang, K., Tian, Y., Yu, C. & Huang, Y. Dual-referenced assistive network for action quality assessment. Neurocomputing614, 128786. 10.1016/j.neucom.2024.128786 (2025). [Google Scholar]
  • 45.Tang, Y. et al. Uncertainty-aware score distribution learning for action quality assessment. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 9839–9848. 10.1109/CVPR42600.2020.00986 (2020).
  • 46.Zhang, B. et al. Auto-encoding score distribution regression for action quality assessment. Neural Comput. Appl.36, 929–942. 10.1007/s00521-023-09068-w (2024). [Google Scholar]
  • 47.Zhou, C., Huang, Y. & Ling, H. Uncertainty-driven action quality assessment. Preprint at 10.48550/arXiv.2207.14513 (2022).
  • 48.Li, M.-Z., Zhang, H.-B., Dong, L.-J., Lei, Q. & Du, J.-X. Gaussian guided frame sequence encoder network for action quality assessment. Complex Intell. Syst.9, 1963–1974. 10.1007/s40747-022-00892-6 (2023). [Google Scholar]
  • 49.Majeedi, A., Gajjala, V. R., GNVV, S. S. S. N. & Li, Y. RICA: Rubric-informed, calibrated assessment of actions. European Conference on Computer Vision 143–161. 10.1007/978-3-031-73036-8_9 (2024).
  • 50.Parmar, P. & Morris, B. T. What and how well you performed? A multitask learning approach to action quality assessment. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 304–313. 10.1109/CVPR.2019.00039 (2019).
  • 51.Li, H.-Y., Lei, Q., Zhang, H.-B. & Du, J.-X. Skeleton based action quality assessment of figure skating videos. 2021 11th International Conference on Information Technology in Medicine and Education (ITME) 196–200. 10.1109/ITME53901.2021.00048 (2021).
  • 52.Lei, Q., Li, H., Zhang, H., Du, J. & Gao, S. Multi-skeleton structures graph convolutional network for action quality assessment in long videos. Appl. Intell.53, 21692–21705. 10.1007/s10489-023-04613-5 (2023). [Google Scholar]
  • 53.Li, H., Lei, Q., Zhang, H., Du, J. & Gao, S. Skeleton-based deep pose feature learning for action quality assessment on figure skating videos. J. Visual Commun. Image Represent.89, 103625. 10.1016/j.jvcir.2022.103625 (2022). [Google Scholar]
  • 54.He, T., Chen, Y., Wang, L. & Cheng, H. An expert-knowledge-based graph convolutional network for skeleton-based physical rehabilitation exercises assessment. IEEE Trans. Neural Syst. Rehabil. Eng.32, 1916–1925. 10.1109/TNSRE.2024.3400790 (2024). [DOI] [PubMed] [Google Scholar]
  • 55.Zahan, S., Hassan, G. M. & Mian, A. Learning sparse temporal video mapping for action quality assessment in floor gymnastics. IEEE Trans. Instrum. Meas.73, 1–11. 10.1109/TIM.2024.3398072 (2024). [Google Scholar]
  • 56.Parmar, P., Reddy, J. & Morris, B. Piano skills assessment. 2021 IEEE 23rd international workshop on multimedia signal processing (MMSP) 1–5. 10.1109/MMSP53017.2021.9733638 (2021).
  • 57.Xia, J. et al. Skating-mixer: Long-term sport audio-visual modeling with MLPs. Proc. AAAI Conf. Artif. Intell.37, 2901–2909. 10.1609/aaai.v37i3.25392 (2023). [Google Scholar]
  • 58.Ding, Y. et al. 2M-AF: A strong multi-modality framework for human action quality assessment with self-supervised representation learning. Proceedings of the 32nd ACM International Conference on Multimedia 1564–1572. 10.1145/3664647.3681084 (2024).
  • 59.Zeng, L.-A. & Zheng, W.-S. Multimodal action quality assessment. IEEE Trans. Image Process.33, 1600–1613. 10.1109/TIP.2024.3362135 (2024). [DOI] [PubMed] [Google Scholar]
  • 60.Zhang, S. et al. Narrative action evaluation with prompt-guided multimodal interaction. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 18430–18439. 10.1109/CVPR52733.2024.01744 (2024).
  • 61.Du, Z., He, D., Wang, X. & Wang, Q. Learning semantics-guided representations for scoring figure skating. IEEE Trans. Multimedia26, 4987–4997. 10.1109/TMM.2023.3328180 (2023). [Google Scholar]
  • 62.Gedamu, K., Ji, Y., Yang, Y., Shao, J. & Shen, H. T. Visual-semantic alignment temporal parsing for action quality assessment. IEEE Trans. Circuits Syst. Video Technol.10.1109/TCSVT.2024.3487242 (2024). [Google Scholar]
  • 63.Parmar, P. & Morris, B. Action quality assessment across multiple actions. 2019 IEEE winter conference on applications of computer vision (WACV) 1468–1476. 10.1109/WACV.2019.00161 (2019).
  • 64.Dadashzadeh, A., Duan, S., Whone, A. & Mirmehdi, M. PECOP: Parameter efficient continual pretraining for action quality assessment. Proceedings of the IEEE/CVF Winter Conference on applications of computer vision 42–52. 10.1109/WACV57701.2024.00012 (2024).
  • 65.Li, Y.-M., Zeng, L.-A., Meng, J.-K. & Zheng, W.-S. Continual action assessment via task-consistent score-discriminative feature distribution modeling. IEEE Trans. Circuits Syst. Video Technol.34, 9112–9124. 10.1109/TCSVT.2024.3396692 (2024). [Google Scholar]
  • 66.Zhou, K. et al. MAGR: Manifold-aligned graph regularization for continual action quality assessment. European Conference on Computer Vision 375–392. 10.1007/978-3-031-73247-8_22 (2024).
  • 67.Matsuyama, H., Kawaguchi, N. & Lim, B. Y. IRIS: Interpretable rubric-informed segmentation for action quality assessment. Proceedings of the 28th International Conference on Intelligent User Interfaces 368–378. 10.1145/3581641.3584048 (2023).
  • 68.Okamoto, L. & Parmar, P. Hierarchical neurosymbolic approach for comprehensive and explainable action quality assessment. Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition 3204–3213. 10.1109/CVPRW63382.2024.00326 (2024).
  • 69.Dong, X., Liu, X., Li, W., Adeyemi-Ejeye, A. & Gilbert, A. Interpretable long-term action quality assessment. Preprint at10.48550/arXiv.2408.11687 (2024). [Google Scholar]
  • 70.Zhang, S.-J., Pan, J.-H., Gao, J. & Zheng, W.-S. Semi-supervised action quality assessment with self-supervised segment feature recovery. IEEE Trans. Circuits Syst. Video Technol.32, 6017–6028. 10.1109/TCSVT.2022.3143549 (2022). [Google Scholar]
  • 71.Parmar, P., Peh, E. & Fernando, B. Learning to visually connect actions and their effects. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 1477–1487. 10.1109/WACV61041.2025.00151 (2025).
  • 72.Gedamu, K., Ji, Y., Yang, Y., Shao, J. & Shen, H. T. Self-supervised subaction parsing network for semi-supervised action quality assessment. IEEE Trans. Image Process.10.1109/TIP.2024.3468870 (2024). [DOI] [PubMed] [Google Scholar]
  • 73.Yun, W., Qi, M., Peng, F. & Ma, H. Semi-supervised teacher-reference-student architecture for action quality assessment. European Conference on Computer Vision 161–178. 10.1007/978-3-031-72904-1_10 (2024).
  • 74.Doughty, H., Damen, D. & Mayol-Cuevas, W. Who’s better? who’s best? Pairwise deep ranking for skill determination. Proceedings of the IEEE conference on computer vision and pattern recognition 6057–6066. 10.1109/CVPR.2018.00634 (2018).
  • 75.Jain, H., Harit, G. & Sharma, A. Action quality assessment using siamese network-based deep metric learning. IEEE Trans. Circuits Syst. Video Technol.31, 2260–2273. 10.1109/TCSVT.2020.3017727 (2020). [Google Scholar]
  • 76.Doughty, H., Mayol-Cuevas, W. & Damen, D. The pros and cons: Rank-aware temporal attention for skill determination in long videos. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 7862–7871. 10.1109/CVPR.2019.00805 (2019).
  • 77.Bai, Y. et al. Action quality assessment with temporal parsing transformer. European conference on computer vision 422–438. 10.1007/978-3-031-19772-7_25 (2022).
  • 78.Luo, Z., Xiao, Y., Yang, F., Zhou, J. T. & Fang, Z. Rhythmer: Ranking-based skill assessment with rhythm-aware transformer. IEEE Trans. Circuits Syst. Video Technol.10.1109/TCSVT.2024.3459938 (2024). [Google Scholar]
  • 79.Gedamu, K., Ji, Y., Yang, Y., Shao, J. & Shen, H. T. Fine-grained spatio-temporal parsing network for action quality assessment. IEEE Trans. Image Process.32, 6386–6400. 10.1109/TIP.2023.3331212 (2023). [DOI] [PubMed] [Google Scholar]
  • 80.Xu, J., Yin, S., Zhao, G., Wang, Z. & Peng, Y. Fineparser: A fine-grained spatio-temporal action parser for human-centric action quality assessment. Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition 14628–14637. 10.1109/CVPR52733.2024.01386 (2024).
  • 81.Xu, J., Rao, Y., Zhou, J. & Lu, J. Procedure-aware action quality assessment: Datasets and performance evaluation. Int. J. Comput. Vis.132, 6069–6090. 10.1007/s11263-024-02146-z (2024). [Google Scholar]
  • 82.Hipiny, I., Ujir, H., Alias, A. A., Shanat, M. & Ishak, M. K. Who danced better? Ranked TikTok dance video dataset and pairwise action quality assessment method. Int. J. Adv. Intell. Inform.9, 96–107. 10.26555/ijain.v9i1.919 (2023). [Google Scholar]
  • 83.Yu, X., Rao, Y., Zhao, W., Lu, J. & Zhou, J. Group-aware contrastive regression for action quality assessment. Proceedings of the IEEE/CVF international conference on computer vision 7919–7928. 10.1109/ICCV48922.2021.00782 (2021).
  • 84.Li, M.-Z. et al. Pairwise contrastive learning network for action quality assessment. European Conference on Computer Vision 457–473. 10.1007/978-3-031-19772-7_27 (2022).
  • 85.An, Q., Qi, M. & Ma, H. Multi-stage contrastive regression for action quality assessment. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 4110–4114. 10.1109/ICASSP48485.2024.10447069 (2024).
  • 86.Fang, M. et al. Which is the better teacher action? A new ranking model and dataset. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 7695–7699. 10.1109/ICASSP48485.2024.10448158 (2024).
  • 87.Pan, J.-H., Gao, J. & Zheng, W.-S. Adaptive action assessment. IEEE Trans. Pattern Anal. Mach. Intell.44, 8779–8795. 10.1109/TPAMI.2021.3126534 (2021). [DOI] [PubMed] [Google Scholar]
  • 88.Zhang, S.-J., Pan, J.-H., Gao, J. & Zheng, W.-S. Adaptive stage-aware assessment skill transfer for skill determination. IEEE Trans. Multimedia26, 4061–4072. 10.1109/TMM.2023.3294800 (2023). [Google Scholar]
  • 89.Tao, L. et al. Sparse hidden markov models for surgical gesture classification and skill evaluation. International conference on information processing in computer-assisted interventions 167–177. 10.1007/978-3-642-30618-1_17 (2012).
  • 90.Zia, A. & Essa, I. Automated surgical skill assessment in RMIS training. Int. J. Comput. Assist. Radiol. Surg.13, 731–739. 10.1007/s11548-018-1735-5 (2018). [DOI] [PubMed] [Google Scholar]
  • 91.Zia, A., Sharma, Y., Bettadapura, V., Sarin, E. L. & Essa, I. Video and accelerometer-based motion analysis for automated surgical skills assessment. Int. J. Comput. Assist. Radiol. Surg.13, 443–455. 10.1007/s11548-018-1704-z (2018). [DOI] [PubMed] [Google Scholar]
  • 92.Li, C., Ling, X. & Xia, S. A graph convolutional siamese network for the assessment and recognition of physical rehabilitation exercises. International conference on artificial neural networks 229–240. 10.1007/978-3-031-44216-2_19 (2023).
  • 93.Zheng, K., Wu, J., Zhang, J. & Guo, C. A skeleton-based rehabilitation exercise assessment system with rotation invariance. IEEE Trans. Neural Syst. Rehabil. Eng.31, 2612–2621. 10.1109/TNSRE.2023.3282675 (2023). [DOI] [PubMed] [Google Scholar]
  • 94.Bruce, X. B., Liu, Y., Chan, K. C. C. & Chen, C. W. EGCN++: A new fusion strategy for ensemble learning in skeleton-based rehabilitation exercise assessment. IEEE Trans. Pattern Anal. Mach. Intell.46, 6471–6485. 10.1109/TPAMI.2024.3378753 (2024). [DOI] [PubMed] [Google Scholar]
  • 95.Zhou, K. et al. Cofinal: Enhancing action quality assessment with coarse-to-fine instruction alignment. Preprint at10.48550/arXiv.2404.13999 (2024). [Google Scholar]
  • 96.Sarfraz, S. et al. Temporally-weighted hierarchical clustering for unsupervised action segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 11225–11234. 10.1109/CVPR46437.2021.01107 (2021).
  • 97.Paszke, A. et al. Automatic differentiation in PyTorch. In Advances in Neural Information Processing Systems 30(2017).
  • 98.Kingma, D. P. & Ba, J. A method for stochastic optimization. Preprint at10.48550/arXiv.1412.6980 (2014). [Google Scholar]
  • 99.Kay, W. et al. The kinetics human action video dataset. Preprint at arXiv:1705.06950 (2017).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES