Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Feb 10;16:8100. doi: 10.1038/s41598-026-39330-9

Skeleton motion topology-masked prediction and contrastive learning for self-supervised human action recognition

Yan Hui 1,, Fengyu Li 1, Xiuhua Hu 1, Xin Guo 1, Chao Shen 1
PMCID: PMC12960941  PMID: 41667723

Abstract

To address the limitations in data augmentation and neglect of joint dependencies in self-supervised human action recognition, this paper proposes a hybrid framework that integrates topology-masked motion modeling with contrastive learning. The proposed motion topology-masking technique jointly encodes skeletal topology and motion dynamics, preventing the model from over-focusing on temporally salient regions of prominent motions. We employ a multi-stage hybrid augmentation strategy, combining conventional and extreme augmentation methods to generate diverse, enriched positive pairs for contrastive learning. Additionally, we introduce a trajectory-guided feature dropping module, which selectively discards critical features based on trajectory attention maps, preventing the model from avoiding excessive focus on local joint trajectories. This approach effectively leverages large-scale unlabeled skeleton data through self-supervised learning, significantly reducing reliance on costly annotated datasets. Extensive experiments on NTU-60, NTU-120, and PKU-MMD demonstrate that the proposed model achieves superior performance in both occluded scenarios under complex environments and low-supervision conditions. It effectively mitigates visual interference and annotation scarcity while substantially improving action recognition accuracy.

Keywords: Masked prediction, Contrastive learning, Motion awareness, Data augmentation

Subject terms: Computational biology and bioinformatics, Engineering, Mathematics and computing

Introduction

As an pivotal research direction in the field of computer vision, human action recognition has witnessed significant performance improvements in recent years with the development of deep learning and sensing technologies, and has shown great potential in intelligent applications.This technology has been widely applied in medical rehabilitation1, intelligent interaction2, security surveillance3, and sports science4, demonstrating significant application value across these domains.While prior research predominantly focused on appearance-based action recognition, alternative approaches have demonstrated the advantages of using posture information.As a key input for human action recognition, skeletal sequences inherently encode rich multi-dimensional spatiotemporal information.Compared to RGB videos, 3D skeleton sequences provide inherent advantages through their low-dimensional topological representation that mitigates privacy concerns regarding appearance disclosure, while their compact data format enhances computational efficiency and demonstrates intrinsic robustness against cluttered backgrounds, abrupt illumination changes, and clothing variations. These characteristics have established skeleton-based representation as a predominant research focus in action recognition, catalyzing substantial advancements in the field. However, current approaches remain constrained by their heavy reliance on annotated data, the acquisition of which is both labor-intensive and time-consuming. To address this limitation, this paper presents a novel investigation into self-supervised representation learning for 3D action recognition.

Current self-supervised human action recognition research primarily evolves around two paradigms: (1) Masked modeling-based approaches learn fine-grained feature representations by reconstructing randomly masked input data, inspired by Masked Autoencoders (MAE)5, these methods randomly mask up to 80% of joints or frame intervals in skeleton sequences, compelling the model to recover occluded content through spatiotemporal context reasoning. Such methods employ pixel-wise reconstruction loss to drive encoders in capturing fine-grained motion patterns, demonstrating outstanding performance in dense prediction tasks like action detection. (2) Contrastive learning-based approaches leverage augmentation techniques (e.g., skeleton cropping, motion jittering) to generate diverse positive sample pairs. By optimizing the InfoNCE loss to maximize feature consistency across different views of the same action, these methods learn discriminative sequence-level representations (e.g., global action semantics), demonstrating superior performance in skeleton-based motion representation.

Despite remarkable progress made by self-supervised learning in the field of human action recognition, existing approaches still confront several critical challenges.First, random masking strategies fail to effectively capture critical motion patterns and overlook joint dependency relationships, thereby limiting the model’s capacity to learn deeper-level features6. Meanwhile, existing contrastive learning approaches predominantly rely on conventional augmentation strategies, which often induce model overfitting and exhibit limited generalization capability7. Consequently, enhancing model generalization and robustness without relying on extensive annotated data has emerged as a pivotal research challenge in the field.

To address the limitations of conventional augmentations prone to overfitting, extreme augmentations causing information bias and critical feature dependency, and random masking strategies failing to capture essential motion patterns in contrastive learning, this paper proposes a novel framework integrating contrastive learning with motion topology modeling. By utilizing motion topology modeling to accurately focus on critical motion characteristics and integrating contrastive learning to build multi-perspective positive sample pairs for feature distribution optimization, this framework coercively guides the model to explore multi-dimensional features and learning more universal motion patterns

The main contributions of this work can be summarized as follows:

  • This research proposes a methodological integration of contrastive learning with motion topology modeling, developing a novel masking mechanism that synthetically incorporates both skeletal topological structure and motion intensity metrics. The designed framework preserves static inter-joint constraints through topological graphs while applying kinematic-aware masking to motion-salient regions using dynamic motion information as auxiliary supervision, with adaptive parameter adjustment to accommodate emerging action patterns.

  • Building upon conventional augmentation techniques, extreme augmentation strategies are employed to construct multi-perspective positive sample pairs, introducing richer motion patterns and more comprehensively exploiting latent information within the training data

  • This paper proposes a trajectory-guided feature dropout module that specializes in analyzing and processing the kinematic trajectory characteristics of individual joints in skeletal data, thereby enabling the model to learn more generalized and transferable feature representations.

  • Comprehensive experiments and ablation studies conducted on three large-scale benchmark datasets systematically validate the effectiveness of the proposed skeleton-based action recognition approach.

Related work

Skeleton-based action recognition

Initial approaches for skeleton-based action recognition primarily utilized hand-crafted feature extraction techniques, exemplified by the STIP method8 which detected spatiotemporal interest points in 3D space. While providing baseline performance, these methods showed limited generalization capability.

For temporal sequence modeling, recurrent architectures have demonstrated superior performance due to their inherent sequence processing capabilities. Long Short-Term Memory networks9 and Gated Recurrent Units10 effectively capture temporal dynamics and long-range dependencies in skeleton sequences. Subsequent developments introduced attention mechanisms (VA-RNN11) and two-stream architectures12 to jointly model temporal dynamics and spatial configurations. The IndRNN architecture13 further improved training stability by addressing gradient vanishing/explosion issues. However, these RNN-based approaches remain fundamentally limited in spatial feature extraction.

For spatial modeling, Convolutional Neural Networks have achieved strong performance through various skeleton representations. Wang et al.’s Joint Trajectory Maps14 encode motion trajectories as three-channel images, while PoseC3D15 directly processes raw skeleton data using 3D convolutions. Caetano et al. developed alternative representations using geometric algebra16 and tree structures17. Although effective, these CNN-based methods typically require converting skeleton data into pseudo-image formats, potentially compromising the inherent spatiotemporal relationships in the original data.

In recent years, Graph Convolutional Networks (GCNs), with their inherent adaptability to graph-structured data, have gradually become an important tool for skeleton-based action recognition. Yan et al.’s ST-GCN18 pioneered joint modeling of spatial and temporal features through graph representations. Subsequent improvements include AS-GCN’s action-specific dependencies19 and InfoGCN’s information bottleneck optimization20. While these methods effectively exploit graph-structured data, their reliance on fixed topologies limits adaptability to diverse actions and long-range dependencies.

In parallel, the Transformer architecture has attracted widespread attention due to its powerful modeling capabilities, especially in handling temporal data. The ViViT model proposed by Arnab et al.21 employs spatiotemporal self-attention on video patches, while MTCN22 captures multimodal temporal contexts. Although these approaches excel at modeling long-range dependencies, their heavy reliance on annotated data poses challenges for skeleton-based tasks where labeling costs are prohibitive, especially for complex actions.

Despite significant advances in skeleton-based action recognition, current methods remain heavily dependent on costly manual annotations. To address this limitation, we propose a self-supervised framework integrating contrastive learning with masked skeleton modeling, significantly reducing annotation requirements while maintaining recognition performance.

Self-supervised masked modeling

Masked modeling has emerged as a powerful self-supervised paradigm, initially pioneered by BERT’s MLM approach in NLP23. In computer vision, ViT24 established the foundation for Transformer-based image processing, enabling subsequent masked approaches. MAE5 introduced an asymmetric encoder-decoder that reconstructs heavily masked images, while BEiT25 advanced this through discrete token prediction. MaskFeat26 further extended the framework by predicting feature descriptors from masked patches. These methods demonstrate masked modeling’s versatility across modalities while maintaining strong representation learning capabilities.

Masked modeling techniques have also been successfully applied to video data and 3D action recognition tasks. VideoMAE27, an extension of MAE5 to video data, learns video representations by randomly masking partial regions of video frames and reconstructing the missing content. This method fully leverages the temporal continuity of video data, significantly improving the effectiveness of video representation learning.

In the field of 3D action recognition, MAMP (Masked Motion Prediction)28 proposed a masked modeling method specifically for skeleton sequences. This method learns skeleton representations by masking partial joints in the skeleton sequence and predicting their motion information. By predicting the temporal motion of the masked joints, MAMP28 enhances the model’s ability to model contextual motion information, significantly improving the performance of Transformer in 3D action recognition. Furthermore, SkeletonMAE29 as a mask-based autoencoder for skeleton sequences, learns skeleton representations by masking partial joints in the skeleton sequence and reconstructing their original positions, further advancing the development of 3D action recognition technology.

Although the aforementioned methods have significantly improved the performance of visual representation learning by introducing masking mechanisms and Transformer architectures, they still face three limitations: high computational costs, strong dependence on large-scale data, and insufficient generalization ability. Additionally, previous works adopt topology-based masking strategies that lack flexibility, and the use of random masking fails to effectively capture key motion information, neglecting the inherent joint topological dependencies of human actions.

Current skeleton-based masking methods often rely on random masking or fixed topological strategies, failing to adequately consider the semantic importance and dynamic characteristics of joint movements, which limits the model’s ability to learn critical motion information. To construct more challenging and semantically relevant data views, introducing auxiliary information-guided masking strategies has become an important research direction. Such adaptive guidance mechanisms using multi-source information have been validated in various computer vision fields. For instance, in image dehazing, Su et al.30 fused multiple prior-based dehazing results through an image quality-guided adaptive weighting scheme; in RGB-thermal tracking, FMTrack31 utilized inter-modal frequency characteristics to construct adaptive filters for cross-modal interaction, while TDAT32 enhanced feature discriminability by modeling target-background dependencies with global agents. The common advantage of these methods lies in their content-aware guidance mechanisms, enabling models to transcend reliance on raw data or random strategies and focus more precisely on key information regions. Inspired by this, our motion topology masking strategy integrates human topological priors with dynamic motion information to establish a content-aware masking guidance mechanism. This approach directs the masking process toward information-rich joints and temporal segments, thereby facilitating the learning of more discriminative action representations during self-supervised pretraining.

Self-supervised contrastive learning

Contrastive learning, as an important branch of self-supervised learning, learns feature representations by constructing positive and negative sample pairs and has shown excellent performance on skeleton data. Its core lies in generating diverse positive sample pairs through data augmentation and optimizing contrastive loss to learn consistent representations.

Recent advances in contrastive learning have significantly advanced self-supervised skeleton representation learning. MoCo33 introduces a dynamic dictionary with momentum updates, while SimCLR34 demonstrates effective view-invariant learning without memory banks. For skeleton data, Chen et al.35 and Guo et al.7 developed specialized augmentation strategies, with the latter successfully adapting MoCo v2. Zhang et al.36 enhanced spatiotemporal modeling through topological mixing, while Gao et al.37 proposed motion trend-aware contrastive learning (ST-CL). CMD38 addresses false negatives through hierarchical invariance learning, and STARS39 replaces augmentation with feature-space neighbors. However, while contrastive learning excels at discriminative feature learning, it often overlooks intrinsic data semantics and structural relationships

Existing methods learn joint features and their relationships from unlabeled skeleton data to generate discriminative motion representations. Studies demonstrate that integrating heterogeneous network architectures can effectively enhance feature learning capabilities. For instance, models such as Inline graphicT-Net40 and CLEAR41 proposed by Bui et al. combine Swin Transformer and Vanilla ViT to construct dual-stream backbone networks, incorporating channel-aware self-attention and cross-fusion modules to jointly model global and local features. Such approaches validate that functionally complementary network branches offer novel insights for learning discriminative representations, also providing references for designing multi-view and multi-granularity feature learning in self-supervised frameworks.

In contrastive learning, data augmentation is key to constructing positive sample pairs. However, samples generated by conventional augmentation strategies often exhibit high similarity, which may lead to model overfitting; while extreme augmentation can improve diversity, it may compromise semantic consistency. Thus, balancing diversity and semantic fidelity becomes a central issue—a challenge also present in weakly-supervised learning. For example, in weakly-supervised image dehazing, Wang et al.42 introduced a discriminator integrating spatial and frequency information to more comprehensively evaluate the discrepancy between generated images and the real distribution, thereby enhancing the realism and clarity of results. This inspires the need for adaptive optimization of positive sample construction and feature representation from multiple perspectives in contrastive learning.

Based on the above analysis, this paper proposes a multi-level augmented blending strategy. Building upon conventional augmentations, extreme augmentations are introduced to construct positive pairs for contrastive learning, thereby incorporating richer motion patterns.

Method

Framework overview

This work proposes a method that combines contrastive learning with motion-topology masked skeleton modeling, enabling simultaneous learning of both sequence-level semantic features and joint-level discriminative characteristics. This study employs a hybrid strategy that integrates conventional augmentation with extreme augmentation to capture the diversity and richness of complex actions, construct positive sample pairs from different perspectives, and uncover a broader spectrum of motion patterns. In addition, by introducing a trajectory-guided feature dropping module, this method conducts in-depth analysis of motion trajectories, amplifies subtle differences in movements, identifies and removes important features, thereby alleviating the potential overfitting problem during extreme augmentation. In the masked modeling module, we introduce a motion-aware strategy to augment the topology-based masking approach. By leveraging motion information as auxiliary guidance, this method prioritizes masking motion-rich regions, thereby integrating both static joint topology and dynamic motion patterns into the masking process.The proposed method uses a three-layer Bi-GRU as its core encoder in a self-supervised setup. This architecture is highly parameter-efficient, reducing overfitting risks when annotated data is scarce. Furthermore, its gating mechanism effectively captures long-term motion dependencies, which is crucial for reconstructing masked joints and discerning action semantics. This robust modeling of bidirectional temporal dynamics provides a strong foundation for understanding motion. For the decoder, we employ a two-layer GRU. The overall framework is illustrated in Fig. 1.

Fig. 1.

Fig. 1

Overall framework diagram.

In Fig. 1, z,Inline graphic,Inline graphic,Inline graphic and Inline graphicrepresent the embedding vector, normal-view embedding vector, extreme-augmented-view embedding vector, discarded extreme-augmented-view embedding vector, and feature vector, respectively.

The proposed framework consists of two main stages: (1) pre-training and (2) action recognition. During the pre-training stage, a self-supervised learning approach integrating masked skeleton modeling with contrastive learning is adopted. This allows the encoder to capture both motion patterns and semantic information inherent in skeleton sequences, ultimately yielding an encoder that can extract high-quality, robust, and discriminative action features. Upon completion of pre-training, the encoder parameters are fixed. In the action recognition stage, the pretrained encoder processes input skeleton sequences to generate feature representations, which are then classified into action categories through a classification head.

Masked skeleton modeling

This paper proposes a motion-topology masking (MTM) strategy, which enhances masked skeleton modeling by incorporating joint motion intensity priors. The method introduces motion-aware masking into the topological masking framework through a weighting mechanism, enabling the masking process to jointly consider both skeletal connectivity and dynamic motion patterns.

The specific process is shown in Fig. 2. First, joint groups are predefined according to human anatomical topology to produce grouped skeletons Inline graphic as candidate masking regions. Then, motion intensity I is computed to generate a motion-aware masking probability distribution Inline graphic. The motion intensity and uniform distribution probabilities are blended to obtain mixed probabilities distribution. According to these mixed probabilities distribution, k joints are sampled and mapped to their affiliated body parts. All joints within the selected parts are masked (set to 0) during the current time segment, yielding the masked skeleton sequence.

Fig. 2.

Fig. 2

Pipeline of the masked skeleton modeling.

Topology-based joint partitioning

First, the input skeleton sequence Inline graphic is divided into grouped sequence Inline graphic by partitioning joints according to human body topology, as shown in Fig. 3. The joints are categorized into the following groups: trunk, left arm, right arm, left leg, and right leg. Joints within each group exhibit close connectivity (e.g., left arm joints including shoulder, elbow, and hand). We apply uniform masking with shared probability p per group,making topological masking probability distribution Inline graphic, Meanwhile, the sequence is divided into different segments along the time dimension, with each segment having a length of l.

Fig. 3.

Fig. 3

Topology-based partitioning.

Motion-aware masking probability computation

The proposed method calculates motion-aware masking probabilities using a motion-sensitive strategy. In contrast to the joint-level motion masking used in MAMP27, our method operates at the body-part level of granularity.Specifically: First, the grouped skeleton sequence Inline graphic is divided into non-overlapping temporal segments Inline graphic. Based on this, the initial motion intensity I is derived by calculating the absolute differences in joint positions between adjacent frames. Subsequently, Multi-channel Inline graphic intensities are averaged to derive per-joint motion importance in each spatiotemporal segment.

graphic file with name d33e582.gif 1

where i is the temporal offset. The motion intensities are normalized into probability distributions using a temperature parameter:

graphic file with name d33e590.gif 2

The topology-based masking strategy incorporates motion intensity by weighted fusion of topological masking probabilities distribution and motion-intensity probabilities distribution, yielding comprehensive masking probabilities distribution. Each joint’s masking probability thus depends on both its topological position and motion intensity, given by:

graphic file with name d33e595.gif 3

Where Inline graphic is the weighting parameter balancing topological and motion information. The motion-topology mixing mechanism simultaneously highlights high-motion regions and prevents key joint overfitting.

The original skeleton input undergoes masking following the aforementioned strategy.This masked skeleton is then processed by encoder Inline graphic to extract feature representations, while the decoder Inline graphic reconstructs the skeleton to produce predicted skeleton Inline graphic The mean squared error (MSE) between original and predicted Inline graphic is optimized specifically for masked regions:

graphic file with name d33e622.gif 4

where N is the total number of masked joints, and M is the binary mask (1 for visible joints, 0 for masked joints). When Inline graphic, Inline graphic indicates the joint requires prediction; when Inline graphic, Inline graphic denotes the joint is known.

Multi-level augmentation hybrid strategy

Conventional approaches, typically reliant on a single augmentation strategy, offer limited positive sample diversity. This work introduces a Multi-level Augmentation Hybrid (MAH) strategy to systematically combine techniques of different granularities and intensities. The framework incorporates both semantics-preserving conventional augmentations and extreme augmentation aimed at discovering new motion patterns. Cooperation with the TGFD module (Section 3.4) alleviates potential noise from extreme transformations, thereby improving feature robustness and discriminability. The MAH strategy, which extends local- and mixed-skeleton augmentations with extreme variants, facilitates a more comprehensive exploration of motion patterns.

Local-skeleton augmentation

Local-skeleton augmentation primarily utilizes Temporal crop-resize43 and Shear44 on individual skeleton sequences for transformation learning, aiming to generate diverse samples. This helps the model learn feature representations with stronger generalization ability. By altering the spatiotemporal structure of skeleton sequences, the model is forced to learn more robust features, enabling it to better capture spatiotemporal relationships and improve its recognition capability.

Specifically, for input skeleton sequence S, positive sample pairs Inline graphic via temporal crop-resize, shear, and joint jittering transformations. Each pair is encoded by query encode Inline graphic and key encoder Inline graphic, then projected through Inline graphic and Inline graphic to yield features Inline graphic . A memory queue stores contrastive learning negatives, with network optimization via InfoNCE loss45:

graphic file with name d33e700.gif 5

Where Inline graphic represents the i-th negative sample feature in the memory queue, Inline graphic is the temperature hyperparameter, and Inline graphic computes their similarity. Normalized Inline graphic and Inline graphic are processed. Post-training, batch samples update the memory queue M following the FIFO policy.

Mixed-skeleton augmentation

Mixed-skeleton augmentation employs CutMix46, ResizeMix47, and Mixup48 to blend pairs of skeleton sequences, generating samples with diverse patterns and actions, thereby enhancing data diversity. Moreover, it enables the model to better understand the relationships between different actions and improves the model’s recognition capability.

Specifically, for two given skeleton sequences Inline graphic and Inline graphic, the mixing method is selected from the three augmentation methods by randomly determining the operation probability to obtain mixed skeleton data Inline graphic. The corresponding embedded features are then extracted via Inline graphic while component Inline graphic is optimized with the InfoNCE loss Eq.5.

Extreme augmentation

Conventional augmentation generates similar positive samples that constrain motion diversity exploration. Therefore, the extreme augmentation strategy proposed by AimCLR9 is further introduced to explore richer movement patterns. This approach combines spatial (translation/shearing/flipping/rotation), temporal (axis masking), and spatiotemporal (Gaussian noise/blur) transformations to produce significantly altered views. To mitigate identity distortion in original sequences and training instability requiring extended duration and sophisticated strategies, we selectively exclude the most aggressive transformations while retaining beneficial extreme augmentation. To mitigate identity distortion in original sequences and training instability requiring extended duration and sophisticated strategies,the extreme augmentation method is still adopted, but some overly extreme methods are excluded.

Specifically, to prevent information loss from axial masking (entire-axis occlusion hinders effective representation learning) and motion detail erosion from excessive Gaussian blur, the extreme augmentation in this paper uses three spatial augmentations: Shear46, Spatial Flip, and Rotate; two temporal augmentations: Crop and Temporal Flip; and one spatiotemporal augmentation: Gaussian Blur, which produces the final extremely-augmented sequence. Subsequently, the embedded data is obtained through Inline graphic. Similarly, Inline graphic is optimized using the InfoNCE loss, as shown in Eq.5.

To counter the identity information loss from extreme augmentations, a dual distribution divergence minimization loss6 is employed between normal- and extreme-augmented views. This loss preserves motion semantics while enhancing feature discriminability, utilizing Kullback-Leibler (KL) divergence to measure distributional differences via two components:Inline graphic for normal versus extreme views, and Inline graphic for normal versus discard-based views:

graphic file with name d33e813.gif 6
graphic file with name d33e817.gif 7
graphic file with name d33e821.gif 8

the final loss:

graphic file with name d33e827.gif 9

This approach effectively leverages novel motion patterns from extreme augmentation while gently minimizing distribution discrepancies, thereby preventing performance degradation caused by overly aggressive transformations.

Trajectory-guided feature dropping module

The proposed trajectory-guided feature dropping (TGFD) module actively eliminates salient features to compel the model to learn more comprehensive representations and mitigate over-dependence on specific local features. By implementing a trajectory-aware attention mechanism, the module computes attention maps through temporal dimension compression, deriving global trajectory representations that focus on channel-wise motion patterns in skeletal data. This approach establishes long-term spatial dependencies between joints while amplifying subtle motion distinctions, thereby enhancing the model’s ability to discriminate semantically similar actions. The detailed pipeline is presented in Fig. 4.

Fig. 4.

Fig. 4

Trajectory-guided feature dropping module.

The Trajectory-guided Feature Dropping (TGFD) module identifies and removes over-relied features by analyzing global joint trajectories. It uses temporal average pooling to create global descriptors, highlighting long-term spatial patterns. These descriptors are then processed by two separate MLPs to produce trajectory-based attention maps that purely reflect spatial importance from global motion.The input feature Inline graphic undergoes temporal average pooling to extract each joint’s global representation Inline graphic across all time steps. Two separate MLPs (each with two fully connected layers and ReLU activation) then non-linearly transform Inline graphic into Query (Inline graphic) and Key (Inline graphic) representations. Channel-wise element-wise multiplication between Query and Key yields initial attention scores, which are normalized via softmax over the joint dimension to produce the final trajectory-aware attention map:

graphic file with name d33e869.gif 10

The trajectory-based attention map Inline graphic and keep parameter jointly generate spatial attention masks Inline graphic and temporal attention masks Inline graphic, where keep controls feature retention by preserving low-attention-weight features while discarding high-attention-weight ones.

The spatial mask Inline graphic is applied to input feature Inline graphic, zeroing out important spatial features followed by normalization to yield feature Inline graphic, Temporal mask Inline graphic then processes Inline graphic, to produce the final output Inline graphic from the trajectory-guided feature dropping module.

The feature Inline graphic , after the dropping process, is projected through embedded projectors Inline graphic and Inline graphic to obtain its corresponding representation Inline graphic. Similarly, the InfoNCE loss is applied to optimize Inline graphic and Inline graphic. separately, as formulated in Eq.5.Unlike traditional regularization methods such as random dropping or low-attention feature selection, the proposed module actively suppresses high-attention features. This design breaks the model’s over-reliance on dominant local patterns, forcing the encoder to explore neglected complementary cues.

Obtaining classification results

The proposed framework trains the encoder via a synergistic mechanism combining contrastive learning with masked prediction to enhance feature representation quality.The masked skeleton prediction branch generates positive samples for contrastive learning, while gradient information - derived from discriminating action sequence features during contrastive learning - is backpropagated to update the reconstruction decoder Inline graphic in the prediction branch.

This framework jointly optimizes contrastive learning and masked prediction during pre-training to produce universal feature representations with superior action recognition capability. The co-optimization compels the encoder to learn discriminative motion patterns and semantic information from skeleton sequences. Through iterative optimization, the total loss Inline graphic progressively converges until encoder parameters stabilize at their feature-extraction-optimal state, after which the encoder becomes frozen.

The total loss function can be expressed as:

graphic file with name d33e959.gif 11

The hyperparameters Inline graphic, Inline graphic and Inline graphic act as weighting coefficients to balance contrastive learning loss Inline graphic, dual distribution divergence loss Inline graphic, and mask reconstruction loss Inline graphic. This weighted combination preserves contrastive learning’s semantic discriminability while enforcing stronger joint-level reconstruction constraints.

In the action recognition phase, input skeleton sequence X is processed through a bidirectional GRU encoder, producing hidden state tensor h with shape Inline graphic, where Inline graphic represents GRU hidden layer size and N denotes batch size. This encoder output is then fed into a fully-connected layer for classification, generating the final output y:

graphic file with name d33e1012.gif 12

Here, W is the weight matrix of the fully connected layer, and b is the bias term, after which the output y is normalized through softmax to produce the probability distribution Inline graphic:

graphic file with name d33e1031.gif 13

The value Inline graphic denotes the classification score for the i-th category among C total categories. The model’s predicted class Inline graphic for each sample is determined by selecting the category with maximum probability:

graphic file with name d33e1050.gif 14

During this phase, the cross-entropy loss quantifies the difference between predicted outputs Inline graphic and ground-truth labels Inline graphic during this phase, driving parameter updates to optimize model training:

graphic file with name d33e1064.gif 15

Where N denotes the dataset size and C the category count, Inline graphic represents the predicted probability of sample i belonging to class j, while the ground-truth label Inline graphic indicates actual class membership (1 for true, 0 otherwise).

The overall flow of the algorithm is shown in Algorithm 1:

Algorithm 1.

Algorithm 1

Procedure of the proposed method.

Experiment

Dataset

NTU RGB+D 6049: This dataset comprises 60 action categories, covering three major scenarios: daily behaviors, medical condition-related actions, and two-person interactions, with a total of 56,578 samples. The standard evaluation protocols include cross-subject (xsub) and cross-view (xview).

NTU RGB+D 12050: This dataset includes 120 categories of actions, covering four major scenarios: daily behaviors, medical diagnosis actions, two-person interactions, and device interactions, with a total of 114,480 samples. Its standard evaluation protocols consist of two types: cross-subject (xsub) and cross-setup (xset).

PKU-MMD51: The dataset is divided into two phases: Phase 1 comprises 1,076 long video sequences covering 51 action categories, while Phase 2 contains 1,009 short video sequences encompassing 41 action categories.

Experimental setup

Data Preprocessing:During data preprocessing, all skeleton sequences were uniformly aligned to 300 frames through zero-padding or truncation. This fixed length was empirically determined to preserve the essential information of most sequences in the dataset. Spatial-temporal joint normalization was applied to eliminate unit differences, with the data ultimately restructured into a normalized tensor format (3D coordinates Inline graphic 300 frames Inline graphic 25 joints Inline graphic 2 subjects) for direct model input.

Network Architecture:The model adopts a three-layer bidirectional GRU (Bi-GRU) as the encoder backbone, where the hidden dimension is set to Inline graphic and the number of encoder layers Inline graphic to 3. The decoder uses a two-layer GRU, with the number of decoder layers Inline graphic set to 2 and the hidden dimension to Inline graphic, which reconstructs the input data. An MLP serves as the projection head to map features into a 128-dimensional embedding space.

Pre-training:During self-supervised pre-training, we employ the SGD optimizer with 0.0001 weight decay and initialize the EMA parameter Inline graphic to 0.999. The model undergoes 429 training epochs with an initial learning rate of 0.02 and decayed to 0.002 at epoch 350. All experiments are implemented using PyTorch on NVIDIA RTX 3080 GPUs with a total batch size of 128.

Computational Efficiency:To evaluate the practical applicability of the proposed method, we further analyze its computational cost. The total number of model parameters is approximately 19.9 million, with a computational complexity of 2.7 GFLOPs. When pre-trained using two NVIDIA RTX 3080 GPUs with a batch size of 128, the process takes about 62 hours, and the peak memory usage per GPU reaches 14.2 GB. During the inference phase, memory usage is significantly reduced to 2.1 GB.

Reporting and Significance Analysis: The experimental results presented in this paper represent the mean accuracy obtained from five independent runs with different random seeds.

Evaluation and comparison

To validate the model effectiveness, experiments are performed on the NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD benchmark datasets with evaluation conducted under three protocols: linear evaluation, fine-tuning, and transfer learning.

Linear Evaluation Protocol:Under the linear evaluation protocol, the pre-trained encoder parameters remain frozen for feature extraction, trained for 100 epochs using an SGD optimizer (momentum=0.9, batch size=256) with an initial learning rate of 0.1. Evaluations on NTU-60 and NTU-120 datasets demonstrate the proposed method’s superior performance compared to state-of-the-art approaches (Table 1).Compared with contrastive learning methods, it demonstrates obvious superiority by adopting the technology that combines contrastive learning with mask prediction. Particularly on the NTU-60 xview dataset, the performance of this method achieves the highest accuracy among all stream inputs, which further proves the superiority of the combination of contrastive learning and mask prediction. By using only joint data as input, it avoids the noise problem introduced by cross-modal fusion and can learn effective feature representations with a smaller amount of computation.

Table 1.

Performance comparison results under the linear evaluation protocol.Single-stream: joints; Three-stream: joints+bones+motion. The best and second-best accuracy rates are highlighted in bold and underlined, respectively.

Method Input NTU-60 NTU-120
xsub (%) xview (%) xsub (%) xset (%)
LongTGAN52 Single-stream 39.1 48.1 35.6 39.7
H-Transformer53 Single-stream 69.3 72.8
MInline graphicL54 Single-stream 52.6
ISC55 Single-stream 76.3 85.2 67.1 67.9
SkeletonMAE29 Single-stream 74.8 77.7 72.5 73.5
CRRL5 Single-stream 67.7 73.8 56.2 57.0
GL-Transformer45 Single-stream 76.3 83.8 66.0 68.7
HaLP56 Single-stream 79.7 86.8 71.1 72.2
CPM57 Single-stream 78.7 84.9 68.7 69.6
P&C58 Single-stream 50.7 75.3 42.7 41.7
CMD38 Single-stream 79.4 86.9 70.3 71.5
SkeletonCLR59 Three-stream 75.0 79.8 60.7 62.6
AimCLR7 Three-stream 78.9 83.8 68.2 68.8
PSTL60 Three-stream 79.1 82.6 69.2 70.3
SkeAttnCLR61 Three-stream 82.0 86.5 77.1 80.0
HYSP62 Three-stream 79.1 85.2 64.5 67.3
RVTCLR+63 Three-stream 79.7 84.6 68.0 68.9
ViA64 Two-stream 78.1 85.8 69.2 66.9
Skeleton-logoCLR65 Single-stream 82.4 87.2 72.8 73.5
Ours Single-stream 80.9 87.8 73.5 74.7

Fine-tuned Evaluation Protocol:Under the fine-tuning protocol, we attach an MLP head to the pre-trained backbone and fine-tune the entire network for 100 epochs (batch size=128) with an initial learning rate of 0.1 following cosine decay. Table 2 demonstrates our evaluation on NTU-60 and NTU-120 datasets. Compared with self-supervised approaches (HYSP62, SSL66, ViA64), our method achieves superior performance on NTU-60 xview and NTU-120 xset benchmarks and second-best accuracy on NTU-60 xsub and NTU-120 xsub. These results validate that our mask prediction-enhanced contrastive learning framework improves supervised generalization, while the pre-trained features exhibit strong discriminative power and transferability.

Table 2.

Performance Comparison under Fine-tuning Protocol.Single-stream: joints; Three-stream: joints+bones+motion. The best and second-best accuracy rates are highlighted in bold and underlined, respectively.

Method Input NTU-60 NTU-120
xsub (%) xview (%) xsub (%) xset (%)
SkeletonMAE29 Single-stream 86.6 92.9 76.8 79.1
CrosSCLR51 Three-stream 84.6 90.5 75.0 77.9
CPM57 Single-stream 84.8 91.1 87.4 78.9
AimCLR7 Three-stream 83.9 90.4 74.6 77.2
HYSP62 Three-stream 89.1 95.2 84.5 86.3
SSL66 Single-stream 92.8 96.5 84.8 85.7
ViA64 Two-stream 89.6 96.4 85.0 86.5
FreqMixFormer67 Single-stream 91.5 96.0 87.9 89.1
BlockGCN68 Single-stream 90.9 95.4 86.9 88.2
Ours Single-stream 90.7 96.6 85.8 86.5

Transfer Learning Protocol: This protocol assesses representation transferability through a two-stage framework: self-supervised pre-training on source datasets (NTU-60/NTU-120) followed by supervised fine-tuning on the target dataset (PKU-II). As shown in Table 3, the model demonstrates reasonable but limited transfer performance.Compared to methods such as MAMP, our approach still has room for improvement in terms of transfer performance on PKU-MMD II. This can be primarily attributed to the significant domain gap between the NTU dataset and the PKU-MMD dataset. The NTU dataset contains a large number of “two-person interaction” actions (e.g., ’handshaking’, ’hugging’), whereas PKU-MMD (particularly in its second partition) focuses more on “individual daily actions” (e.g., ’drinking water’, ’typing on a keyboard’). This discrepancy in action category distribution and interaction complexity may lead to insufficient discriminative capability for certain fine-grained actions in PKU-MMD by models pre-trained on NTU. Furthermore, the masking strategy applied to the source data simulates continuous occlusion scenarios, while occlusions in the target dataset are more random and irregular. This inconsistency in occlusion patterns also contributes to the suboptimal transfer performance.

Table 3.

Performance Comparison under Transfer Learning Protocol.Source datasets: NTU-60 and NTU-120. Best and second-best accuracy are highlighted in bold and underlined, respectively.

Method To PKU-II
NTU-60 (%) NTU-120 (%)
LongTGAN52 44.8
MS2L53 45.8
ISC55 51.1 52.3
CMD38 56.0 57.0
SkeletonMAE29 58.4 61.0
MAMP28 70.6 73.2
Ours 59.5 60.1

Ablation study

This section provides an in-depth analysis of the proposed method. The results are obtained under the linear evaluation protocol.

Module Effectiveness Analysis:To validate the contribution of each proposed module, we conduct comprehensive experiments on the NTU-60 xview dataset (Table 4). The model achieves competitive action recognition performance when using motion-topology masking (MTM) with conventional augmentation (CA). Switching to random masking (RM) and multi-level augmentation hybrid (MAH) yields significant improvement, while combining motion topology masking with hybrid augmentation further refines performance. The optimal results are attained by integrating the trajectory-guided feature dropping (TGFD) module, demonstrating that our three strategies collectively enable the encoder to learn more robust and task-adaptive features.

Table 4.

Ablation Study on Proposed Modules.Inline graphic indicates module usage, best results in bold.

MTM MAH TGFD RM CV NTU-60 (%)
Inline graphic Inline graphic 84.8
Inline graphic Inline graphic 85.9
Inline graphic Inline graphic 86.1
Inline graphic Inline graphic Inline graphic 87.8

Masking Strategy Analysis:The proposed Motion Topology Masking (MTM) strategy combines motion-aware and topological masking to create more challenging data views. Comparative results on NTU-60 xview and NTU-120 xset datasets (Table 5) confirm its efficacy: random masking (RM) is the least effective, while introducing either topology-based (TRM) or motion-aware (MAM) masking individually boosts performance to similar levels, implying complementary supervisory signals. The complete MTM strategy achieves state-of-the-art results, proving that motion cues act as a powerful prior for semantic-rich masking. This synergy with topological masking avoids over-dependence on motion alone and delivers the highest performance.

Table 5.

Ablation Study on Masking Strategies. best results in bold.

Method NTU-60(%) NTU-120 (%)
RM 86.6 73.2
TRM 87.2 73.9
MAM 87.1 73.5
MTM 87.8 74.7

Masking Ratio Analysis: this paper conducts the ablation study on the masking ratio using the NTU-60 XView dataset. As a widely adopted benchmark in skeleton-based action recognition, the evaluation results from this dataset are recognized for being generalizable and comparable. As shown in Fig. 5, our experiments with different masking ratios reveal that both excessively high and low ratios degrade performance on the NTU-60 xview dataset. Empirical results demonstrate that a 40% masking ratio achieves optimal performance, balancing sufficient data augmentation with information preservation.

Fig. 5.

Fig. 5

Ablation Study on Masking Ratios.

Ablation Study on the Masking Weight Inline graphic: To validate the balancing mechanism between topological priors and motion information in our motion topology masking strategy, we conducted an ablation study on the mixture weight Inline graphic under the NTU-60 xview protocol. As shown in Table 6), as Inline graphic increases from 0 (pure motion masking) to 1 (pure topology masking), model performance first rises and then declines, peaking at 87.8% whenInline graphic. This optimal configuration indicates that the topological structure provides a slightly dominant prior (weight 0.6), while motion information serves as an essential dynamic semantic supplement (weight 0.4). The synergy between the two enables the generation of masked views that are both semantically challenging and physiologically plausible, confirming the theoretical advantage of the linear combination in Eq. (4).

Table 6.

Ablation Analysis of the Mixture Weight Inline graphic on NTU-60 xview,best results in bold.

Inline graphic Topological Weight Motion Weight NTU-60 (%)
0.0 0.0 1 86.6
0.2 0.2 0.8 87.0
0.4 0.4 0.6 87.5
0.6 0.6 0.4 87.8
0.8 0.8 0.2 87.3
1.0 0.0 1.0 86.9

Extreme Augmentation Combination: To systematically evaluate the contribution of each component in our extreme augmentation strategy, we conducted component-level ablation studies on the NTU-60 xview dataset. As shown in Table 7, based on the complete set of six transformations (including Shear, Spatial Flip, Rotate, Crop, Temporal Flip, and Gaussian Noise), the separate introduction of the excluded Axial Masking and Gaussian Blur resulted in significant accuracy drops of 2.7% and 1.8%, respectively. This indicates that Axial Masking compromises representation learning by disrupting the spatial topology of skeletons, while Gaussian Blur impairs motion discriminability by excessively smoothing temporal features. Our experiments demonstrate that the adopted six transformations collectively achieve an optimal balance between enhancing data diversity and preserving essential action semantics.

Table 7.

Ablation Study for the Extreme Augmentation Combination.

Augmentation Strategy NTU-60 (%)
complete set 87.8
incorporating Axial Masking 85.1
incorporating Gaussian Blur 86.0

Enhanced Performance Analysis: As evidenced in Table 8, comprehensive experiments on NTU-60 xview demonstrate distinct performance patterns across augmentation strategies: conventional augmentation (CA) alone achieves 84.8% accuracy, while exclusive extreme augmentation (EA) degrades performance due to excessive feature distortion. The hybrid approach combining conventional and extreme augmentations yields superior results to conventional-only, confirming their complementary effects. While standard Dropout (SD) improved accuracy to 86.3%, and ordinary attention (OA) further increased it to 86.9%, and trajectory-based attention (TBA) achieving optimal performance through enhanced motion pattern learning from extreme-augmented contrastive samples. These results systematically validate three key insights: first, standalone extreme augmentation carries inherent risks of feature destruction; second, balanced hybrid strategies are essential for optimal performance; third, trajectory-guided attention provides unique representational advantages in skeleton-based learning.

Table 8.

Ablation Study of Augmentation Methods.Inline graphic indicates module usage, best results in bold.

CA EA SD OA TBA NTU-60 (%)
Inline graphic 84.8
Inline graphic 81.3
Inline graphic Inline graphic 86.1
Inline graphic Inline graphic Inline graphic 86.9
Inline graphic Inline graphic Inline graphic 86.9
Inline graphic Inline graphic Inline graphic 87.8

Qualitative Comparison: Figure 6 presents t-SNE visualizations comparing our method with CMD38 and MAMP28 on NTU60’s 15 action classes.t-SNE visualization, a widely adopted qualitative evaluation tool in skeleton-based action recognition, effectively reveals the distribution structure of features in low-dimensional space by clearly demonstrating intra-class compactness and inter-class separation. CMD’s cross-modal distillation yields semantically-driven but dispersed distributions, while MAMP enforces strict inter-cluster separation with intra-cluster compactness. Our approach demonstrates distinct cluster formation (e.g., well-separated “giving object” (gray) and “shaking hands” (dark blue)), reflecting effective inter-class discrimination, while maintaining intra-class cohesion (e.g., tight “pick up” (yellow) clusters).Remaining challenges appear in semantically similar actions (“reading” (blue)/“phone use” (lavender)) where shared motion patterns (“bending”,“grasping”) cause overlap, suggesting needs for finer-grained feature learning.

Fig. 6.

Fig. 6

t-SNE Visualization of Feature Embeddings.

Hyperparameter analysis

To systematically evaluate the model’s sensitivity to key hyperparameters, we conducted a comprehensive parameter analysis on the NTU-60 xview dataset. All parameters were optimized through grid search following a systematic parameter scanning strategy.As shown in Table.9 , with Inline graphic fixed at 1.0, the optimal balance is achieved when Inline graphic = 0.3 and Inline graphic = 30. This configuration effectively coordinates the learning processes of different pre-training tasks while preventing any single task from dominating the training. For the motion topology masking weight Inline graphic, the model performs best at Inline graphic, with performance degradation limited to within 0.5% across the range Inline graphic, demonstrating its robustness. In the TGFD module, the optimal keep ratio is 0.7, indicating that retaining 70% of high-attention features achieves an effective balance between breaking model dependency and preserving essential semantic information. The temperature parameter Inline graphic , which controls the smoothing of similarity distributions in contrastive learning, reaches its optimal value at Inline graphic = 0.07, maintaining effective distinction between positive and negative samples while ensuring training stability.

Table 9.

Sensitivity Analysis of Key Hyperparameters.

Hyperparameters Search Range Optimal Value Sensitive Interval
Inline graphic [1.0, 2.0] 1.0 [0.7, 1.3]
Inline graphic [0.1, 1.0] 0.3 [0.2, 0.5]
Inline graphic [10, 50] 30 [25, 35]
Inline graphic [0.3, 0.9] 0.6 [0.5, 0.7]
keep [0.05, 0.9] 0.7 [0.65, 0.75]
Inline graphic [0.05, 0.09] 0.07 [0.06, 0.09]

Limitations and future work

This work has demonstrated notable performance gains, though several limitations remain. Addressing these limitations outlines a clear path for future work:

the model demonstrates sensitivity to data quality and exhibits unstable performance when processing noisy data. the representation learning framework shows limited capacity in modeling multi-level semantic features, making it challenging to effectively capture fine-grained motion characteristics. the model’s cross-domain generalization capability remains constrained by structural priors.

Our future work will focus on three primary directions: performing an in-depth qualitative analysis (by visualizing TGFD attention and MTM masks) to inform model refinement; integrating the pre-training framework with powerful backbone networks such as Transformer and GCN; and exploring a robust pre-training paradigm that incorporates fine-grained attention and adaptive topology learning, coupled with asymmetric architecture design to enhance training efficiency.

Conclusion

This study presents a self-supervised framework that synergizes motion-topology masked prediction with contrastive learning to enhance action recognition.The motion-topology masking strategy selectively focuses on high-intensity motion regions while maintaining topological constraints, enabling effective learning of discriminative motion patterns and joint dependencies.A hybrid strategy of normal and extreme augmentations is adopted to introduce novel motion patterns, increasing the diversity of contrastive learning and helping the model learn richer feature representations. To address the potential loss of original identity information caused by extreme augmentations, a trajectory-guided feature dropping module is introduced to enable the model to learn more comprehensive features and enhance recognition accuracy.Comprehensive ablation studies and comparative evaluations confirm the method’s superior recognition performance.

Acknowledgements

This work was supported in part by the Shaanxi Provincial Natural Science Foundation under Grant 2025JC-YBMS-764 and in part by the National Natural Science Foundation of China under Grant 52302505.

Author contributions

Y.H. conceived the methodology and experiment(s), F.L. provided the resources, F.L., Y.H. and X.H. conducted the experiment(s),Y.H. and C.S. analysed the results,X.G reviewed the results. X.G.,and C.S. contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Data availability

The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Lopez-Nava, I. H. & Muñoz-Meléndez, A. Human action recognition based on low- and high-level data from wearable inertial sensors. Int. J. Distrib. Sens. Netw.15(12), 1477–1550 (2019). [Google Scholar]
  • 2.Rodomagoulakis, I. et al. Multimodal human action recognition in assistive human-robot interaction. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2702–2706 (2016).
  • 3.Lin, W., Sun, M.-T., Poovandran, R. & Zhang, Z. Human activity recognition for video surveillance. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), 2737–2740 (2008).
  • 4.Qi, M. et al. StagNet: An attentive semantic RNN for group activity recognition. In Proceedings of the European Conference on Computer Vision (ECCV), 104–120 (2018).
  • 5.He, K. et al. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 15979–15988 (2022).
  • 6.Wang, P. et al. Contrast-reconstruction representation learning for self-supervised skeleton-based action recognition. IEEE Trans. Image Process.31, 6224–6238 (2022). [DOI] [PubMed] [Google Scholar]
  • 7.Guo, T. et al. Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 762–770 (2022).
  • 8.Laptev, I. & Lindeberg, T. Space-time interest points. In Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV), 432–439 (2003).
  • 9.Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput.9, 1735–1780 (1997). [DOI] [PubMed] [Google Scholar]
  • 10.Ruiz, L., Gama, F. & Ribeiro, A. Gated graph recurrent neural networks. IEEE Trans. Signal Process.68, 6303–6318 (2020). [Google Scholar]
  • 11.Zhang, P. et al. View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans. Pattern Anal. Mach. Intell.41, 1963–1978 (2019). [DOI] [PubMed] [Google Scholar]
  • 12.Wang, H. & Wang, L. Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3633-3642 (2017).
  • 13.Li, S. et al. Independently recurrent neural network (IndRNN): Building a longer and deeper RNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5457-5466 (2018).
  • 14.Wang, P., Li, W., Li, C., Hou, Y. Action recognition based on joint trajectory maps with convolutional neural networks. In Proceedings of the 24th ACM International Conference on Multimedia (ACM MM), 102-106 (2016).
  • 15.Duan, H. et al. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2959-2968 (2022).
  • 16.Caetano, C. et al. SkeleMotion: A new representation of skeleton joint sequences based on motion information for 3D action recognition. In Proceedings of the 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 1-8 (2019).
  • 17.Caetano, C., Brémond, F. & Schwartz, W. R. Skeleton image representation for 3D action recognition based on tree structure and reference joints. In Proceedings of the 32nd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), 16-23 (2019).
  • 18.Yan, S., Xiong, Y. & Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. AAAI Conf. Artif. Intell.32, 7444–7452 (2018). [Google Scholar]
  • 19.Li, M. et al. Actional-structural graph convolutional networks for skeleton-based action recognition.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3590-3598 (2019).
  • 20.Chi, H.-G. et al. Skeleton image representation for 3D action recognition based on tree structure and reference joints. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16-23 (2022).
  • 21.Arnab, A. et al. ViViT: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 6816-6826 (2021).
  • 22.Kazakos, E. et al. With a little help from my temporal context: Multimodal egocentric action recognition. In Proceedings of the British Machine Vision Conference (BMVC), 610-611 (2021).
  • 23.Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 4171-4186 (2019).
  • 24.Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR) (2021).
  • 25.Bao, H., Dong, L., Piao, S. & Wei, F. BEiT: BERT pre-training of image transformers. In Proceedings of the International Conference on Learning Representations (ICLR) (2022).
  • 26.Wei, C. et al. Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14668-14678 (2022).
  • 27.Tong, Z., Song, Y., Wang, J. & Wang, L. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 10078-10093 (2022).
  • 28.Mao, Y. et al. Masked motion predictors are strong 3D action representation learners. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 10147-10157 (2023).
  • 29.Wu, W. et al. SkeletonMAE: Spatial-temporal masked autoencoders for self-supervised skeleton action recognition. In Proceedings of the IEEE International Conference on Multimedia and Expo Workshops (ICMEW), 224-229 (2023).
  • 30.Su, Y. et al. TReal Scene Single Image Dehazing Network With Multi-Prior Guidance and Domain Transfer. IEEE Trans. Multimedia27, 5492–5506 (2025). [Google Scholar]
  • 31.Xue, Y. et al. FMTrack: Frequency-aware Interaction and Multi-Expert Fusion for RGB-T Tracking. IEEE Trans. Circuits Syst. Video Technol.1, 1–1 (2025). [Google Scholar]
  • 32.Xue, Y. et al. Target-distractor aware UAV tracking via global agent. IEEE Trans. Intell. Transp. Syst.26(10), 16116–16127 (2025). [Google Scholar]
  • 33.He, K. et al. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9726-9735 (2020).
  • 34.Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning (ICML), 1597-1607 (2020).
  • 35.Chen, Z., Liu, H., Guo, T., Chen, Z. et al. Contrastive learning from spatio-temporal mixed skeleton sequences for self-supervised skeleton-based action recognition. 10.48550/arXiv.2207.03065 [DOI]
  • 36.Zhang, J., Lin, L. & Liu, J. Hierarchical consistent contrastive learning for skeleton-based action recognition with growing augmentations. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 3427-3435 (2023).
  • 37.Gao, X. et al. Efficient spatio-temporal contrastive learning for skeleton-based 3D action recognition. IEEE Trans. Multimedia25, 405–417 (2021). [Google Scholar]
  • 38.Mao, Y. et al. CMD: Self-supervised 3D action representation learning with cross-modal mutual distillation. In Proceedings of the European Conference on Computer Vision (ECCV), 734-752 (2022).
  • 39.Mehraban, S., Rajabi, M. J. & Taati, B. STARS: Self-supervised tuning for 3D action recognition in skeleton sequences. 10.48550/arXiv.2407.10935 [DOI]
  • 40.Bui, D.C. et al. C2T-Net: Channel-Aware Cross-Fused Transformer-Style Networks for Pedestrian Attribute Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), 351-358 (2024).
  • 41.Bui, D. C. et al. CLEAR: Cross-transformers with pre-trained language model for person attribute recognition and retrieval. Proc. Pattern Recog.164, 111486 (2025). [Google Scholar]
  • 42.Wang, N. et al. Weakly supervised image dehazing via physics-based decomposition. Proc. IEEE Trans. Multimedia27, 5492–5506 (2025). [Google Scholar]
  • 43.Zeng, Q., Liu, C., Liu, M. & Chen, Q. Contrastive 3D human skeleton action representation learning via CrossMoCo with spatiotemporal occlusion mask data augmentation. IEEE Trans. Multimedia25, 1564–1574 (2023). [Google Scholar]
  • 44.Oord, A. v. d., Li, Y. & Vinyals, O. Representation learning with contrastive predictive coding. 10.48550/arXiv.1807.03748 [DOI] [PubMed]
  • 45.Kim, B., Chang, H.J., Kim, J. & Choi, J.Y. Global-local motion transformer for unsupervised skeleton-based action learning. In Proceedings of the European Conference on Computer Vision (ECCV), 209-225 (2022).
  • 46.Yun, S. et al. CutMix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 6022-6031 (2019).
  • 47.Ren, S. et al. A simple data mixing prior for improving self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14575-14584 (2022).
  • 48.Zhang, H., Cisse, M., Dauphin, Y. N. & Lopez-Paz, D. Mixup: Beyond empirical risk minimization. 10.48550/arXiv.1710.09412 [DOI]
  • 49.Shahroudy, A., Liu, J., Ng, T.-T. & Wang, G. NTU RGB+D: A large scale dataset for 3D human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1010-1019 (2016).
  • 50.Liu, J. et al. NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell.42, 2684–2701 (2020). [DOI] [PubMed] [Google Scholar]
  • 51.Liu, Y. et al. A benchmark dataset and comparison study for multi-modal human action analytics. CM Trans. Multimedia Comput. Commun. Appl.16, 41:1-41:24 (2020). [Google Scholar]
  • 52.Zheng, N. et al. Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2644-2651 (2018).
  • 53.Cheng, Y.-B. et al. Hierarchical transformer: Unsupervised representation learning for skeleton-based human action recognition. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), 1-6 (2021).
  • 54.Lin, L., Song, S., Yang, W. & Liu, J. Ms2l: Multi-task self-supervised learning for skeleton-based action recognition. In Proceedings of the ACM International Conference on Multimedia (ACM MM), 2490-2498 (2020).
  • 55.Tanfous, A. B., Zerroug, A., Linsley, D. & Serre, T. How and what to learn: Taxonomizing self-supervised learning for 3D action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2888-2897 (2022).
  • 56.Shah, A.et al. HaLP: Hallucinating latent positives for skeleton-based self-supervised learning of actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18846-18856 (2023).
  • 57.Zhang, H., Hou, Y., Zhang, W. & Li, W. Contrastive positive mining for unsupervised 3D action representation learning. In Proceedings of the European Conference on Computer Vision (ECCV), 36-51 (2022).
  • 58.Su, K., Liu, X. & Shlizerman, E. PREDICT & CLUSTER: Unsupervised skeleton based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9628-9637 (2020).
  • 59.Li, L. et al. 3D human action representation learning via cross-view consistency pursuit. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4739-4748 (2021).
  • 60.Zhou, Y. et al. Self-supervised action representation learning from partial spatio-temporal skeleton sequences. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 3825-3833 (2023).
  • 61.Hua, Y. et al. Part aware contrastive learning for self-supervised action recognition. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 855-863 (2023).
  • 62.Franco, L., Mandica, P., Munjal, B. & Galasso, F. HYperbolic self-paced learning for self-supervised skeleton-based action representations. In Proceedings of the International Conference on Learning Representations (ICLR) (2023).
  • 63.Zhu, Y., Han, H., Yu, Z. & Liu, G. Modeling the relative visual tempo for self-supervised skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 13867-13876 (2023).
  • 64.Yang, D. et al. ViA: View-invariant skeleton action representation learning via motion retargeting. Int. J. Comput. Vision. 132, 2351–2366 (2023).
  • 65.Zhou, Y. et al. Spatio-temporal gated graph attention network for skeleton-based action recognition. IEEE Trans. Image Process. 34, 10257–10271 (2024).
  • 66.Yan, H. et al. SkeletonMAE: Graph-based masked autoencoder for skeleton sequence pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 5606-5618 (2023).
  • 67.Wu, W. et al. Frequency Guidance Matters: Skeletal Action Recognition by Frequency-Aware Mixed Transformer. In Proceedings of the 32nd ACM International Conference on Multimedia (MM), 4660-4669 (2024).
  • 68.Zhou, Y. et al. BlockGCN: Redefine Topology Awareness for Skeleton-Based Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2049-2058 (2024).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES