Abstract
This study proposes a unified multimodal temporal motion state perception framework for optical imaging-oriented biomedical applications, integrating visual skeleton sequences, inertial measurement unit (IMU) signals, and surface electromyography (EMG) signals. The framework utilizes modality-specific encoders and a cross-modal temporal alignment attention mechanism to explicitly model temporal offsets from heterogeneous sensing streams. A multimodal temporal Transformer backbone is introduced to capture long-range motion dependencies and cross-modal interactions, while an uncertainty-aware fusion module dynamically allocates weights based on modality confidence. Experimental results demonstrate that the proposed approach achieves an accuracy of 94.37%, an F1-score of 93.95%, and a mean average precision of 96.02%, outperforming mainstream baseline models. Robustness evaluations further confirm stable performance under visual occlusion and sensor noise. These results indicate that the framework provides a highly accurate and robust solution for rehabilitation assessment, sports training monitoring, and wearable intelligent interaction systems.
Keywords: temporal synchronization, kinematic-physiological fusion, uncertainty-aware weighting, wearable sensor integration, pose estimation
1. Introduction
The primary objective of this study is to develop a unified AI-driven multimodal framework that achieves robust human motion monitoring by resolving the inherent challenges of data heterogeneity and temporal misalignment. Human motion state perception represents a critical research direction in intelligent sensing and human–computer interaction, demonstrating extensive application value in motion analysis, rehabilitation training assessment [1,2], and intelligent medical monitoring [3,4,5]. However, the precise modeling of these states through objective, automated approaches requires bridging the gap between high-level semantic recognition and low-level biomechanical signal integration. Therefore, rather than focusing solely on generic perception models, this work directs its efforts toward the architectural synchronization of visual and wearable streams to establish a reliable foundation for smart healthcare systems [6,7].
Traditional human motion analysis methods have evolved from classical state estimation to sophisticated data-driven frameworks [8,9]. Early sensor fusion techniques, such as those based on Kalman filters and their variants, were widely employed to achieve real-time state estimation and noise reduction by fusing linear or near-linear kinematic signals [10]. While Kalman filters provide a mathematically rigorous approach for sensor fusion, they often encounter limitations when handling high-dimensional, non-linear human skeletal sequences and the stochastic nature of electromyography (EMG) signals. Consequently, the field has transitioned toward deep learning paradigms. Vision-based methods primarily perform human detection and pose estimation through video imagery [11,12], utilizing convolutional neural networks (CNNs) and graph convolutional networks (GCNs) to model skeletal topology [13,14,15]. However, these methods remain vulnerable to occlusion and lack observability of internal biomechanical dynamics [16,17,18,19].
To address these limitations, wearable sensor-based methods (IMU/EMG) provide continuous signals with high temporal resolution [20,21]. IMUs capture limb dynamics, while EMG signals reveal neuromuscular control mechanisms [22,23]. While RNNs and TCNs have strengthened temporal modeling [24,25], wearable sensors struggle to represent spatial structural relationships [26,27,28]. Consequently, multimodal motion analysis has emerged as a research hotspot, aiming to fuse visual information with sensor signals for a complementary representation [29,30]. Modern architectures, particularly Transformer models, have demonstrated outstanding performance in capturing cross-modal interactions [31,32,33]. Recent frameworks like Husformer [34] and MTFT [35] have attained significant gains in recognition accuracy [36,37]. A comprehensive assessment of current state-of-the-art (SOTA) methods regarding their fusion strategies and handling of frequency heterogeneity is summarized in Table 1.
Table 1.
Summary of representative heterogeneous data fusion methods and their handling of recording frequencies.
| Method/Paradigm | Fused Modalities | Fusion Technique | Frequency Heterogeneity Handling |
|---|---|---|---|
| Classical Estimation | IMU + Vision | Kalman Filter/EKF | Linear interpolation or state prediction |
| Recurrent Modeling | IMU + EMG | BiLSTM/GRU | Zero-padding or fixed-window segmentation |
| Graph-based Fusion | Skeleton + IMU | Adaptive GCN | Feature-level concatenation after resampling |
| Attention-based | Vision + Sensor | Standard Transformer | Pre-processing via spline interpolation |
| Proposed UADMF | Vision + IMU + EMG | Temporal Alignment Attention | Dynamic learnable offset modeling and asynchronous matching |
Despite these advancements, critical research gaps remain. First, inconsistent sampling rates across modalities make temporal synchronization difficult. Second, visual errors and sensor noise often become coupled and amplified during fusion. Third, existing fusion strategies lack the capability to model modality uncertainty under varying data quality [38,39,40]. Therefore, achieving cross-modal alignment within a unified temporal semantic space has become the core issue [41].
To fill these gaps, this study proposes a unified multimodal temporal Transformer framework. We adopt unified temporal modeling as the core paradigm and design a cross-modal temporal alignment attention mechanism to dynamically match heterogeneous data with different sampling rates and temporal offsets. Furthermore, we introduce an uncertainty-aware fusion strategy that estimates modality reliability and adaptively allocates weights. The main contributions are:
We propose a unified multimodal Transformer framework for joint representation learning of vision-based skeletons, IMU signals, and EMG signals, bridging the gap between high-level spatial structure and low-level biomechanical dynamics.
We develop a cross-modal temporal alignment attention mechanism utilizing learnable offsets to resolve asynchronous timing discrepancies inherent in heterogeneous data streams with significantly different sampling rates.
We introduce an uncertainty-aware fusion strategy that dynamically allocates modality weights based on predictive reliability, ensuring system robustness under conditions of visual occlusion or sensor noise.
We establish a rigorous validation benchmark on a 9216-sample dataset, providing empirical evidence that the framework maintains superior stability and synchronization accuracy in complex, real-world motion scenarios.
2. Materials and Method
2.1. Data Collection
In this study, a multimodal human motion state perception dataset was constructed through a self-established experimental acquisition platform. The data sources comprised three modalities, including visual video, inertial measurement unit signals, and surface electromyography signals, aiming to collaboratively characterize human motion states from the perspectives of spatial structure, kinematic variation, and muscle activation mechanisms, as shown in Table 2. To address the significant discrepancy in sampling frequencies—ranging from 30 Hz for video to 1000 Hz for EMG—a unified hardware synchronization strategy was implemented. A central synchronization controller (master clock) was utilized to issue a simultaneous start-trigger pulse to the camera system, the IMU network, and the EMG acquisition device at the beginning of each session. This hardware-level triggering ensured that all modality streams shared a common global time zero.
Table 2.
Summary of the multimodal human motion dataset collected in this study.
| Modality | Sensor/Source | Sampling Rate | Data Volume |
|---|---|---|---|
| RGB Video | Multi-view Cameras | 30 fps | 9216 video clips |
| Skeleton Sequences | Pose Estimation (derived) | 30 fps | 9216 sequences |
| IMU Signals | Wrist/Ankle/Waist IMUs | 100 Hz | 27,648 signal streams |
| EMG Signals | Surface EMG (multi-channel) | 1000 Hz | 18,432 signal streams |
| Annotated Motion Labels | Manual + Semi-auto | – | 12 action classes |
The data acquisition campaign was conducted in multiple phases from March 2024 to November 2024. Experimental sites were arranged in both a standard indoor motion laboratory and a rehabilitation training assessment center. Experimental tasks included continuous action recognition, motion phase segmentation, and posture stability evaluation. Visual data were collected using a multi-view RGB camera system at 30 fps. For temporal alignment during multimodal fusion, the maximum temporal synchronization error across all sensors was measured to be less than 1 ms, which is significantly finer than the 33.3 ms interval of the video frames. In the post-processing stage, to reconcile the heterogeneous sampling rates, skeletal keypoints extracted at 30 Hz and IMU data at 100 Hz were upsampled to a unified 1000 Hz temporal grid using cubic spline interpolation, aligned to the high-resolution EMG timestamps.
IMU data were collected through wearable nodes at 100 Hz. Raw signals were recorded with timestamps anchored to the master clock to ensure cross-modal temporal consistency. EMG data were acquired using a multi-channel system at 1000 Hz. To reduce noise, skin surfaces were cleansed, and band-pass filtering was applied. By maintaining a sub-millisecond synchronization accuracy, the framework ensures that rapid muscle activation spikes in the EMG signals are precisely correlated with the corresponding kinematic variations and spatial posture changes, effectively mitigating the alignment artifacts often encountered in high-frequency heterogeneous data fusion.
All participants were healthy adult volunteers, totaling 32 individuals, with a balanced gender ratio and an age distribution ranging from 22 to 35 years. The demographic characteristics of the participants, including sex, height, and weight distribution, are summarized in Table 3.
Table 3.
Demographic characteristics of the participants ().
| Parameter | Mean ± SD/Range | Unit |
|---|---|---|
| Age | 28.5 ± 4.2/22–35 | years |
| Sex (Male/Female) | 16/16 | - |
| Height | 171.2 ± 7.4/155–186 | cm |
| Weight | 66.4 ± 10.8/48–92 | kg |
Each participant completed 3–5 acquisition rounds for all motion sequences, with each round lasting approximately 6–10 min, resulting in more than 85 h of effective recorded data. A three-level aligned annotation strategy integrating video, skeleton, and sensor streams was adopted. Action categories and motion phases were first annotated at the video frame level, followed by temporal mapping to corresponding IMU and EMG signal segments. A unified temporal semantic labeling system was thereby established, providing high-quality supervisory information for subsequent multimodal joint modeling.
2.2. Data Preprocessing and Augmentation Strategy
In the multimodal temporal human motion state perception framework, data preprocessing and data augmentation—a strategy used to artificially increase the diversity and volume of the training set by applying various transformations to existing samples—are not merely engineering procedures for performance improvement but also critical theoretical processes that ensure multi-source signals are comparable and fusion-compatible within a unified semantic space. In alignment with the primary objective of this study to establish a robust and high-precision monitoring framework, these steps ensure that discrepancies across modalities in terms of sampling mechanisms, physical meanings, noise distributions, and scale ranges are addressed. Without unified normalization and temporal consistency processing, subsequent cross-modal attention modeling would struggle to learn stable mapping relationships. Therefore, preprocessing is conducted from both visual and sensor modalities, combined with temporal alignment and resampling strategies, to construct unified multimodal temporal representations.
During the video data preprocessing stage, human detection algorithms are first employed to localize target regions. Let the raw video frame be denoted as , where t represents the temporal index and H and W denote image height and width, respectively. Through the human detection model, a bounding box is obtained, enabling the cropped human-region image to be extracted. Subsequently, a pose estimation network (specifically, the HRNet-w32 architecture) is applied to extract keypoint coordinates. The skeletal keypoint set of the t-th frame is denoted as , where J represents the number of joints. To eliminate scale variations caused by inter-subject differences and camera distance changes, keypoint normalization is required. A common approach adopts torso length or keypoint centroid as the reference scale, formulated as
| (1) |
where and denote the centroid coordinates of keypoints in the current frame, and represents the scale factor, such as shoulder width or torso length. Through this normalization operation, different video sequences are mapped into a unified spatial scale, reducing spatial distribution discrepancies that may interfere with model training. Each frame skeleton is ultimately represented as , and a temporal skeleton sequence is constructed as the visual modality input.
2.3. Proposed Method
2.3.1. Overall Architecture
After temporal alignment and feature preparation of multimodal data, an end-to-end processing pipeline is constructed at the model design level, consisting of modality-specific encoding, cross-modal temporal alignment, unified temporal modeling, uncertainty-aware fusion, and task prediction. Let the aligned visual skeleton sequence be , where J is the number of joints. Similarly, let the IMU sequence be and the EMG sequence be , where and represent the number of inertial and electromyography channels, respectively. These sequences share a unified temporal index t.
The three modalities are first processed by dedicated modality-specific encoders to project these heterogeneous features into a shared latent space with a unified embedding dimension d. This identical dimensionality d ensures that features from different sources can be directly interacted with and compared within the subsequent attention-based modules. The skeleton encoder extracts spatial structural information and local motion patterns from joint coordinate sequences, producing . The IMU encoder refines temporal kinematic characteristics from acceleration and angular velocity sequences, yielding . The EMG encoder captures muscle activation intensity and rhythmic dynamics, generating .
To explicitly address potential fine-grained temporal delays and semantic misalignment across modalities, a cross-modal temporal alignment attention module is introduced. Anchored on visual features at each time step t, the most relevant temporal context is adaptively selected from neighboring sensor segments, resulting in softly aligned representations and , which are combined with to form temporally consistent joint representations. The fused features are then fed into a multimodal temporal Transformer backbone, where long-range intra-modal dependencies and cross-modal interactions are jointly modeled within a shared attention space, producing high-level temporal semantic representations . During the fusion stage, an uncertainty-aware module estimates modality reliability at each time step and dynamically assigns weights to aggregate multimodal features, suppressing noisy modalities while enhancing reliable ones. A robust global motion state representation is thereby obtained. Finally, is passed to task-specific prediction heads to generate action categories, motion phases, or stability scores, enabling stable and generalizable human motion state perception under complex motion scenarios.
2.3.2. Cross-Modal Temporal Alignment Attention Module
The cross-modal temporal alignment attention module is positioned before the unified temporal modeling backbone, aiming to explicitly mitigate implicit temporal misalignment between visual skeleton sequences and wearable sensor sequences caused by sampling discrepancies, neuromuscular response latency, and device clock drift. Unlike conventional self-attention mechanisms that model temporal dependencies within a single modality, the proposed module adopts cross-modal and cross-temporal neighborhood attention as the fundamental computational unit. The visual modality serves as a temporal anchor, while sensor modalities are softly matched within local temporal windows to establish explicit alignment in a unified semantic space.
As shown in Figure 1, the encoded visual representation is denoted as , and the encoded IMU and EMG representations are denoted as , where T represents temporal length, and C denotes channel dimensionality. The module adopts a hierarchical stacked architecture composed of linear projection sublayers, relative temporal position encoding sublayers, and multi-head cross-modal attention sublayers. Through parallel attention heads, temporal offset patterns and cross-modal semantic correlations are learned within different subspaces, enhancing alignment resolution and modeling robustness. For a visual feature at time step t, a query vector is generated via linear projection, while sensor features within a local temporal neighborhood centered at t are projected to keys and values. The cross-modal temporal alignment attention is computed as follows:
| (2) |
| (3) |
| (4) |
where denotes the sensor modality and represents the temporal neighborhood set. Compared with standard self-attention,
| (5) |
the proposed mechanism decouples query and key-value sources across modalities and restricts attention computation to a local temporal domain. Consequently, the objective of attention learning shifts from generic sequence dependency modeling to explicit temporal offset estimation. Mathematically, this process can be interpreted as learnable dynamic time warping along the temporal axis, where the attention distribution functions as an implicit probability density over temporal alignment positions. To further enhance temporal compensation capability, a learnable temporal offset parameter is introduced and incorporated via relative positional encoding into the attention calculation:
| (6) |
| (7) |
where denotes the learnable temporal offset and represents the relative temporal encoding term. According to the expectation alignment property of attention, when the distribution satisfies
| (8) |
the aligned output achieves optimal temporal matching under the mean squared error criterion, thereby minimizing cross-modal semantic misalignment. After stacked alignment layers, temporally consistent features and are obtained and integrated with visual representations to provide precisely aligned inputs for subsequent global dependency modeling.
Figure 1.
Schematic illustration of the cross-modal temporal alignment attention module, depicting dynamic soft alignment between visual skeleton and sensor sequences within local temporal neighborhoods via cross-modal attention.
2.3.3. Multimodal Temporal Transformer Representation Learning Module
The multimodal temporal Transformer representation learning module operates on the aligned multimodal features and performs deep temporal dependency modeling and cross-modal interaction learning within a unified semantic space.
As illustrated in Figure 2, the aligned modality representations are denoted as . These features are interleaved along the temporal dimension such that tokens from different modalities at the same time step become locally adjacent, forming a joint sequence . The backbone adopts a 4-layer Transformer encoder architecture to facilitate deep hierarchical feature abstraction. Each encoder layer is meticulously structured with a multi-head self-attention (MHSA) mechanism consisting of 8 parallel attention heads, a hidden dimensionality , and a position-wise feed-forward network (FFN) with an inner-layer dimension of 1024.
Figure 2.
Schematic illustration of the multimodal temporal Transformer representation learning module, showing cross-modal interaction and long-range dependency modeling of aligned multimodal temporal features within a unified attention space.
Each layer consists of pre-layer normalization (Pre-LN), multi-head cross-modal self-attention, and FFN sublayers with residual connections:
| (9) |
| (10) |
To explicitly capture both intra-modal temporal evolution and inter-modal coupling, modality relation bias and relative temporal bias are incorporated into attention weight computation:
| (11) |
| (12) |
To prevent over-smoothing and enhance training stability, the GELU activation function and a dropout rate of 0.1 are applied within each FFN block. Because attention weights constitute a probability distribution across rows, each output vector is a convex combination of input value vectors, satisfying
| (13) |
which ensures bounded aggregation and prevents uncontrolled noise amplification. Through stacked layers, any two temporal steps or modality tokens become connected via attention pathways, establishing a global receptive field in the representation space. The final sequence is reorganized by temporal aggregation to produce , encoding posture structure variation, kinematic rhythm, and muscle activation dynamics.
2.3.4. Uncertainty-Aware Fusion Module
The uncertainty-aware fusion module operates after multimodal temporal Transformer representation learning and performs reliability estimation and adaptive weighted aggregation of high-level modality-specific representations . The module follows a sequential structure comprising feature recalibration, uncertainty estimation, dynamic weight generation, and weighted fusion.
As illustrated in Figure 3, each modality feature is first processed by a modality-specific uncertainty encoder consisting of two time-distributed perceptron layers with nonlinear activation. The temporal energy distribution of modality features is formulated as
| (14) |
and the corresponding uncertainty measure is defined as
| (15) |
Fusion weights are generated through normalized inverse confidence:
| (16) |
and the fused representation is computed as
| (17) |
Assuming each modality provides an unbiased estimate of a latent representation , the expected squared error becomes
| (18) |
which is minimized when weights are proportional to inverse uncertainty, consistent with the proposed formulation. A temporal smoothness regularization term
| (19) |
is further introduced to maintain temporal consistency. Through adaptive confidence-based collaboration, modality contributions are dynamically adjusted according to scene conditions, ensuring robust and semantically consistent representations for downstream motion state prediction tasks.
Figure 3.
Schematic illustration of the uncertainty-aware fusion module, demonstrating adaptive confidence-based weighting and integration of multimodal features to enhance overall robustness.
2.4. Experimental Setup
The experimental hardware platform was established based on a high-performance workstation equipped with an NVIDIA RTX 4090 GPU (24 GB VRAM), an Intel Xeon multi-core CPU, and 128 GB DDR4 memory. The software environment was built on Ubuntu 22.04 using Python 3.10 and the PyTorch 2.x deep learning framework. Data preprocessing was performed using NumPy 2.3.0 and SciPy 1.14.0 for signal filtering, and OpenCV 4.13 for video frame parsing. Specifically, all IMU and EMG signals underwent Z-score normalization and were processed through a Butterworth band-pass filter to eliminate motion artifacts, while skeletal coordinates were normalized relative to the root joint to ensure spatial consistency.
To provide a thorough and precise description of the model output, the proposed framework employs a multi-task prediction layer. For action recognition, the output is a probability distribution generated via a Softmax function, where represents the action categories. For motion phase prediction, the model outputs a sequence of frame-level phase labels , enabling precise segmentation of movement stages. For posture stability evaluation, the model generates a continuous scalar score through a Sigmoid activation, where a higher value indicates superior balance and rhythmic consistency. These outputs are jointly optimized to provide a comprehensive characterization of the user’s motion state.
In response to the requirement to clearly define the data used for training and evaluation, we implemented a strict subject-independent partition protocol. The dataset, comprising 9216 multi-view samples, was divided based on participant identity to ensure that the model is evaluated on entirely unseen subjects. Specifically, the training set consists of data from 70% of the participants, used for model weight optimization via backpropagation. The validation set comprises data from 10% of the participants, employed exclusively for hyperparameter tuning and early stopping to prevent overfitting. The test set (or evaluation set) consists of the remaining 20% of the participants, which is held out and used only for the final performance assessment. This clear separation ensures that no temporal samples from a participant in the training set appear during the evaluation phase, thereby providing a rigorous measure of the system’s generalization capability to new individuals.
To ensure the technical precision of the proposed model, the detailed architectural specifications are summarized in Table 4. For the training iterations, a five-fold cross-validation strategy was adopted on the training and validation subjects, and the final results are reported as the average across the five runs. Upon acceptance of the manuscript, the complete source code will be made publicly available at https://github.com/Aurelius-04/UADMF.git (accessed on 6 April 2026).
Table 4.
Architectural specifications and hyperparameter configurations of the proposed framework.
| Module | Parameter Name | Value/Configuration |
|---|---|---|
| Skeleton Encoder | Architecture | Temporal Transformer |
| Layer/Attention Heads | 4 Layers/8 Heads | |
| Hidden/FFN Dimension | 256/1024 | |
| IMU Encoder | Architecture | Bidirectional LSTM |
| Hidden Layers/Size | 2 Layers/128 units | |
| EMG Encoder | Architecture | Temporal Convolutional Network (TCN) |
| Dilation Factors | [1, 2, 4, 8] | |
| Alignment & Fusion | Cross-attention Heads | 4 Heads |
| Modality Dropout (p) | 0.2 | |
| Fusion Mechanism | Uncertainty-aware Weighting | |
| Training Protocol | Optimizer/Weight Decay | AdamW/ |
| Learning Rate Schedule | Cosine Annealing () | |
| Batch Size/Epochs | 32/150 | |
| Dropout Rate | 0.1 |
2.5. Baseline Models and Evaluation Metrics
In the selection of baseline models, representative approaches are comprehensively covered across three mainstream technical paradigms, including unimodal modeling, conventional multimodal fusion, and deep multimodal temporal learning, thereby establishing a systematic comparative framework. Among them, ST-GCN [42] relies on the topological structure of the human skeleton to model spatial dependencies among joints, enabling effective characterization of motion structural variations. CTR-GCN [43] further improves this by implementing a channel-wise topology refinement mechanism, allowing the model to learn dynamic joint correlations tailored to different feature channels for enhanced spatial discrimination. BiLSTM [44] performs dynamic modeling of IMU temporal signals through a bidirectional gated recurrent mechanism, capturing contextual information from both past and future time steps. Limu-BERT [45] introduces a Transformer-based representation learning approach for inertial signals, leveraging self-attention to capture deep temporal dependencies and robust features from IMU data. TCN [46] employs dilated convolutional structures to model multi-scale temporal patterns in EMG signals, facilitating efficient extraction of muscle activation rhythms while enhancing long-sequence modeling efficiency. Early Fusion (Concat + MLP) [47] realizes joint representation of multi-source information through feature-level concatenation, providing a unified input space for cross-modal collaboration. Late Fusion (Weighted Average) [48] integrates multimodal prediction outputs at the decision level via weighted aggregation, offering strong structural flexibility. Multimodal Transformer [49] leverages global self-attention mechanisms to model cross-modal interaction relationships and long-range temporal dependencies within a unified semantic space, delivering powerful joint representation capability for complex motion state perception.
Accuracy was adopted to measure overall classification correctness, F1-score was used to evaluate the balance between precision and recall, and mAP was employed to assess overall ranking capability in multi-class detection or phase recognition tasks. In addition, a stability metric [50] was introduced to characterize temporal smoothness and fluctuation of predictions along the time dimension, enabling comprehensive evaluation of multimodal temporal model performance. The mathematical formulations of the evaluation metrics are defined as follows:
| (20) |
| (21) |
| (22) |
| (23) |
| (24) |
Here, , , , and denote true positives, true negatives, false positives, and false negatives, respectively. represents precision at recall level r. C denotes the number of classes, and represents the average precision of class c. T denotes the temporal length, represents the predicted output at time step t, and denotes the vector norm.
3. Results and Discussion
3.1. Baseline Comparison and Computational Efficiency Analysis
The purpose of this set of experiments is to systematically validate the performance, robustness, and cost-efficiency advantages of the proposed framework. To ensure a fair and comprehensive assessment, all baseline models and the proposed framework were trained and evaluated on a unified training database consisting of 9216 multimodal temporal sequences, providing a rigorous performance-to-cost evaluation.
As shown in Table 5, while the proposed UADMF framework involves a larger parameter count (16.2 M) and higher inference latency (38.5 ms) due to its triple-modality encoders and alignment module, it achieves a superior balance between complexity and robustness. Specifically, the framework yields a significant gain in temporal Stability (0.958) and F1@25 (91.22%) compared to high-performance unimodal models like CTR-GCN. The average inference time of 38.5 ms remains well within the real-time processing threshold (approximately 50 ms), ensuring its feasibility for clinical deployment. Furthermore, the training process on the 9216-sample database demonstrates that the framework effectively leverages multimodal redundancy to suppress noise and jitter without requiring an excessively large database for convergence. In summary, the results confirm that the proposed framework offers a high-performance, robust, and cost-effective solution for motion state perception in complex biomedical environments.
Table 5.
Comprehensive comparison with baseline models (Mean ± SD). Performance is evaluated across five-fold cross-validation.
| Method | Accuracy (↑) | F1-Score (↑) | F1@25 (↑) | mAP (↑) | Stability (↑) | Params (M) | Time (ms) |
|---|---|---|---|---|---|---|---|
| ST-GCN (Skeleton) | 3.1 | ||||||
| CTR-GCN (Skeleton) | 1.46 | ||||||
| BiLSTM (IMU) | 2.5 | ||||||
| Limu-BERT (IMU) | 12.1 | ||||||
| TCN (EMG) | 0.85 | ||||||
| Early Fusion | 5.2 | ||||||
| Late Fusion | 5.1 | ||||||
| Multi-Transformer | 14.8 | ||||||
| Proposed (Full) | 16.2 |
3.2. Robustness Evaluation
The purpose of this experiment is to systematically evaluate the robustness of the proposed multimodal temporal motion perception framework under complex interference and information deficiency conditions. To ensure the reproducibility of these evaluations, we have strictly quantified the interference parameters. For “Visual Occlusion,” we simulated partial perception by randomly masking 30% to 50% of the skeletal joints in each frame. “Viewpoint Variation” was implemented by introducing random spatial rotations of to the skeletal coordinates to simulate perspective shifts. “Sensor Noise” was modeled by injecting additive white Gaussian noise (AWGN) into the raw IMU and EMG streams, maintaining a signal-to-noise ratio (SNR) of 20 dB. “Modality Absence” refers to a complete signal loss (100% dropout) for the specified modality during the entire inference sequence. Unlike the primary experiment that focuses on recognition accuracy under ideal conditions, this evaluation verifies the system’s reliability in practical deployment.
As shown in Table 6 and Figure 4, the model achieves peak performance under normal conditions. When various perturbations are introduced, all metrics decrease to varying degrees, yet the overall degradation remains controlled. The quantitative definition of these perturbations allows for a more rigorous assessment: even with a 50% joint occlusion rate or a significant 20 dB noise level, the accuracy remains above 91%, which underscores the effective error-suppression capability of our uncertainty-aware fusion module. The most pronounced decline occurs under visual occlusion, reflecting the importance of skeletal structural information. Performance degradation under viewpoint variation is comparatively limited, indicating stable structural modeling. The smallest decline appears under sensor noise, suggesting that the fusion stage exhibits strong tolerance to signal quality fluctuations. Modality-missing experiments further reveal differences in multimodal contributions. The largest performance drop occurs when the visual modality is removed, indicating its irreplaceable role in spatial structural constraints. From a mechanistic perspective, the robustness observed under these quantified noise and occlusion levels arises from the redundancy in multimodal attention. When a modality is corrupted, its contribution is automatically attenuated, while other modalities compensate through attention-based information propagation, thereby maintaining stable decision boundaries. This dual mechanism of cross-modal redundancy and adaptive confidence modulation enables consistently high performance, confirming the environmental adaptability of the proposed framework.
Table 6.
Robustness evaluation of the proposed model under challenging conditions (Mean ± SD across five-fold cross-validation).
| Setting | Accuracy (↑) | F1-Score (↑) | mAP (↑) | Stability (↑) |
|---|---|---|---|---|
| Normal condition | ||||
| Visual occlusion (30–50% joints) | ||||
| Viewpoint variation ( shift) | ||||
| Sensor noise (SNR = 20 dB) | ||||
| Missing Vision modality (100% loss) | ||||
| Missing IMU modality (100% loss) | ||||
| Missing EMG modality (100% loss) |
Figure 4.
Robustness performance trends of the proposed model under occlusion, noise, viewpoint variation, and modality-missing conditions.
3.3. Ablation Study
The ablation experiments are designed to systematically validate the independent contribution and cooperative effect of key components in the proposed framework, with a specific focus on the reliability and robustness of the temporal alignment mechanism. By progressively removing or replacing core modules, performance variations are analyzed to demonstrate the architectural rationality. The variants include: (1) the full model; (2) w/o Cross-Modal Temporal Alignment Attention, which replaces the alignment module with standard linear interpolation; (3) w/o Temporal Alignment, which directly concatenates heterogeneous features without any synchronization or offset compensation; (4) w/o Multimodal Temporal Transformer; (5) w/o Uncertainty-Aware Fusion; and (6) w/o Modality Dropout.
As shown in Table 7, the full model achieves optimal performance across all metrics. The gain in accuracy is primarily attributed to the Cross-Modal Temporal Alignment Attention module. Comparing the full model to the w/o Cross-Modal Alignment variant, we observe a 1.56% drop in accuracy. This highlights that simply upsampling to a unified 1000 Hz grid using spline interpolation is insufficient to bridge the gap between 30 Hz skeletal sequences and 1000 Hz EMG signals. The learnable attention mechanism is the specific element that enables the model to resolve the sub-millisecond synchronization errors and neuromuscular latency, ensuring that kinematic variations are precisely correlated with muscle activation spikes. The system’s robustness is significantly enhanced by the Uncertainty-Aware Fusion module. When this module is removed, the accuracy decreases from 94.37% to 91.93%, and the Stability score drops to 0.926. This module is critical because it dynamically estimates modality reliability and adaptively allocates weights. In practical scenarios involving visual occlusion or sensor noise, this module suppresses unreliable streams and enhances stable ones, thereby preventing error propagation that would otherwise degrade the decision boundaries. Furthermore, the Multimodal Temporal Transformer backbone serves as the foundation for temporal smoothness; by capturing long-range dependencies and intra-modal interactions, it ensures a high Stability score of 0.958, effectively mitigating prediction flickering that occurs in non-aligned or non-Transformer-based architectures. In summary, the synergistic integration of these architectural elements provides a physiologically grounded invariant for robust motion perception under complex, asynchronous acquisition conditions.
Table 7.
Ablation study of key components in the proposed framework (Mean ± SD across five-fold cross-validation), validating the necessity and reliability of temporal alignment.
| Variant | Accuracy (↑) | F1-Score (↑) | F1@25 (↑) | mAP (↑) | Stability (↑) |
|---|---|---|---|---|---|
| Full model | 94.37 | ||||
| w/o Cross-Modal Alignment Attention | |||||
| w/o Temporal Alignment (Direct) | |||||
| w/o Multimodal Transformer | |||||
| w/o Uncertainty-Aware Fusion | |||||
| w/o Modality Dropout |
3.4. Generalizability Validation on Public Dataset
To further verify the generalization capability of the proposed framework and ensure the architecture is not overfitted to the self-collected dataset, we conducted extensive evaluations on the UTD-MHAD public benchmark. UTD-MHAD is a widely recognized multimodal dataset comprising synchronized skeletal sequences and inertial measurement unit (IMU) signals. By adopting this dataset, we aim to demonstrate that the core components of our framework—specifically the modality-specific encoders, the cross-modal temporal alignment, and the uncertainty-aware fusion—maintain their superior performance across different hardware configurations and acquisition environments. For this experiment, all baseline models mentioned in our comparative study were re-implemented and evaluated under a subject-independent protocol to ensure a fair and rigorous comparison.
As shown in Table 8, the proposed framework consistently outperforms both unimodal and multimodal baselines. Among unimodal methods, CTR-GCN and Limu-BERT show significant improvements over traditional ST-GCN and BiLSTM, confirming the efficacy of dynamic topology refinement and Transformer-based temporal modeling. However, their performance remains lower than that of the fusion-based methods, highlighting the necessity of cross-modal collaboration. While Early and Late Fusion strategies provide baseline improvements, they are unable to effectively model the complex temporal offsets present in heterogeneous data. The Multimodal Transformer achieves higher accuracy by capturing long-range dependencies, yet it still lags behind our complete framework. The full proposed model achieves the highest accuracy of 92.15% and a stability score of 0.932, proving that the integration of temporal alignment and uncertainty-aware weighting effectively mitigates the performance degradation caused by cross-domain distribution shifts. These results confirm that our framework provides a robust and scalable solution for human motion monitoring that is applicable beyond the specific conditions of the original training environment.
Table 8.
Generalizability performance comparison on the UTD-MHAD public dataset (Mean ± SD across five-fold cross-validation).
| Method | Accuracy (↑) | F1-Score (↑) | mAP (↑) | Stability (↑) |
|---|---|---|---|---|
| ST-GCN (Skeleton only) | ||||
| CTR-GCN (Skeleton only) | ||||
| BiLSTM (IMU only) | ||||
| Limu-BERT (IMU only) | ||||
| Early Fusion (Concat + MLP) | ||||
| Late Fusion (Weighted Average) | ||||
| Multimodal Transformer | ||||
| Proposed (Full Model) |
3.5. Discussion
The proposed AI-driven multimodal framework represents a significant advancement over extant motion monitoring paradigms by establishing a unified semantic space for vision-based skeleton sequences, inertial rhythms, and muscle activation intensity. Compared to state-of-the-art methods such as Husformer and MTFT [34,35], which primarily focus on high-level feature fusion, our UADMF framework explicitly addresses the underlying synchronization issues inherent in heterogeneous sensing. Our results demonstrate a clear performance advantage over the strongest unimodal baseline, CTR-GCN [43], with a gain of approximately 5% in accuracy (94.37% vs. 89.45%) and a substantial improvement in temporal stability (0.958 vs. 0.898). This accuracy gain is critical for biomedical applications; for example, the 5% improvement translates to a significantly higher sensitivity in detecting neuromuscular compensatory patterns, where identifying the exact millisecond-level discrepancy between muscle firing (EMG) and limb displacement (Skeleton/IMU) is essential for clinical diagnosis.
The gains in accuracy and robustness are directly attributable to specific architectural innovations. The cross-modal temporal alignment attention module resolves the semantic displacement caused by the massive frequency discrepancy between 30 Hz video and 1000 Hz EMG signals. Furthermore, the uncertainty-aware fusion module provides the necessary error-suppression to maintain performance above 91% even when subjected to 50% joint occlusion or 20 dB sensor noise, a level of resilience not observed in traditional early or late fusion techniques. In terms of computational cost, the framework achieves a balance between complexity and real-time feasibility. While the model utilizes 16.2 M parameters—more than unimodal TCN or BiLSTM architectures—it maintains an inference latency of 38.5 ms, which is well within the 50 ms threshold required for real-time biomedical monitoring. During the training phase, the model demonstrated efficient convergence on the 9216-sample database. On a high-performance workstation equipped with an NVIDIA RTX 4090 GPU, the framework reaches convergence within 150 epochs, requiring approximately 5 h of total training time. This indicates that the architecture effectively leverages multimodal redundancy to learn generalized features without requiring excessively large datasets.
Regarding the relevance and transferability of our findings, the dataset utilized in this study—comprising fundamental movements such as walking, squatting, and lunging—represents a comprehensive set of biomechanical primitives that are common to a wide range of physical activities. By modeling the intricate relationships between skeletal kinematics, joint angular velocities, and neuromuscular activation during these representative tasks, the framework captures generalized motion features rather than task-specific patterns. This indicates a high degree of transferability to more complex athletic or clinical activities, such as stair climbing or sports-specific strength training, which share the same underlying flexion-extension and multi-joint coordination principles. Furthermore, the inclusion of diverse observational conditions, such as viewpoint variations and partial occlusions within our database, ensures that the learned multimodal representations are robust to environmental shifts. The framework’s ability to maintain high stability across these fundamental primitives suggests that the structure-rhythm-force coupling acts as a physiologically grounded invariant, providing a reliable foundation for monitoring unobserved physical activities in real-world biomedical applications.
3.6. Limitation and Future Work
Although stable recognition performance is achieved in complex motion scenarios, several aspects require further investigation. A primary limitation of this study lies in the scale and demographic diversity of the dataset, which is currently restricted to 32 healthy adult volunteers within a narrow age range of 22 to 35 years. While this cohort provides a baseline for technical validation, the findings may potentially over-claim clinical or biomedical applicability, as the results may not generalize to elderly populations or patients with significant musculoskeletal impairments. Furthermore, the dataset was primarily collected in controlled indoor environments; despite incorporating occlusion, noise, and viewpoint perturbations to enhance diversity, distribution discrepancies remain compared with large-scale outdoor sports, clinical rehabilitation wards, or crowded interactive environments. Generalization capability under broader open-world conditions requires further cross-scenario validation. Additionally, multimodal synchronization relies on unified acquisition systems; in practical deployment, clock drift and communication latency among heterogeneous devices may introduce more complex temporal misalignment, demanding higher real-time adaptability from alignment mechanisms. Furthermore, long-term wearing comfort and stability of IMU and EMG sensors may influence continuous monitoring quality, indicating the need for optimization in sensor layout and lightweight system integration. Future research may focus on multi-center collaborative data collection, weakly supervised or self-supervised temporal representation learning, edge-side real-time inference acceleration, and low-power wearable system integration to enhance deployability and long-term operational reliability in real-world complex environments.
4. Conclusions
This study targets the challenge that human motion states are difficult to be stably perceived under complex motion scenarios, where multi-source sensing data are characterized by temporal asynchrony, semantic misalignment, and quality fluctuations that hinder collaborative modeling. To address these issues, a unified multimodal temporal perception framework is constructed. By introducing modality-specific encoding mechanisms, deep feature representations of skeletal structures, inertial motion dynamics, and muscle activation signals are effectively extracted. On this basis, a cross-modal temporal alignment attention module is designed to explicitly alleviate temporal offsets induced by heterogeneous sampling mechanisms and physiological response delays. Furthermore, a multimodal temporal Transformer representation learning network is developed to model long-range motion dependencies and cross-modal interaction relationships within a unified attention space. In addition, an uncertainty-aware fusion mechanism is incorporated to dynamically allocate modality weights according to modality confidence, thereby suppressing noise interference while reinforcing reliable information contributions. Through this multi-level architectural design, systematic collaboration is achieved across temporal consistency modeling, semantic co-representation, and fusion reliability regulation. Comprehensive experimental results demonstrate that the proposed method achieves significant performance advantages on the multimodal human motion state perception dataset, with overall Accuracy reaching , F1-score reaching , and mAP reaching , while attaining an optimal Stability score of . Comparative experiments verify that the complete framework consistently outperforms single-modality models, conventional fusion strategies, and non-aligned multimodal Transformer architectures. Robustness evaluations further indicate that only limited performance degradation is observed under visual occlusion, viewpoint variation, sensor noise, and modality-missing conditions. Ablation studies additionally confirm that the cross-modal temporal alignment module, the temporal Transformer backbone, and the uncertainty-aware fusion mechanism each contribute critically to overall performance improvement. Overall, high-accuracy, high-stability, and strongly generalized unified modeling is achieved under complex motion environments, providing reliable technical support for human motion state perception in practical scenarios such as rehabilitation assessment, athletic training, and intelligent human–machine interaction. Furthermore, large-scale deployment of the framework in rehabilitation and sports training may reduce labor-intensive assessment costs, improve resource allocation efficiency, and provide data-driven economic value for the digital transformation of the sports health industry.
Author Contributions
Conceptualization, Q.C., X.W., R.C. and Y.Z.; Data curation, S.H. and Y.L.; Formal analysis, S.L.; Funding acquisition, Y.Z.; Investigation, S.L.; Methodology, Q.C., X.W. and R.C.; Project administration, Y.Z.; Resources, S.H. and Y.L.; Software, Q.C., X.W. and R.C.; Supervision, Y.Z.; Validation, S.L.; Visualization, S.H. and Y.L.; Writing—original draft, Q.C., X.W., R.C., S.H., Y.L., S.L. and Y.Z.; Q.C., X.W. and R.C. contributed equally to this work. All authors have read and agreed to the published version of the manuscript.
Institutional Review Board Statement
The study was conducted in accordance with the Declaration of Helsinki and approved by the Ethics Committee of China Agricultural University (protocol code 2025079; date of approval: 13 July 2025).
Data Availability Statement
The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.
Conflicts of Interest
The authors declare no conflict of interest.
Funding Statement
This research was funded by National Natural Science Foundation of China grant number 61202479.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
References
- 1.Xie Y., Yang L., Zhang M., Chen S., Li J. A review of multimodal interaction in remote education: Technologies, applications, and challenges. Appl. Sci. 2025;15:3937. doi: 10.3390/app15073937. [DOI] [Google Scholar]
- 2.Su Y., Wang C., Chen J., Zhang Y., Chen G., Dong C., Guan H. High-strength and tough PPy organohydrogel with electromagnetic shielding in extreme environments and underwater human motion monitoring. Mater. Today Chem. 2026;52:103400. doi: 10.1016/j.mtchem.2026.103400. [DOI] [Google Scholar]
- 3.Xue Y., Yu Y., Yin K., Li P., Xie S., Ju Z. Human in-hand motion recognition based on multi-modal perception information fusion. IEEE Sens. J. 2022;22:6793–6805. doi: 10.1109/JSEN.2022.3148992. [DOI] [Google Scholar]
- 4.Zhang Y., Tian J., Xiong Q. A review of embodied intelligence systems: A three-layer framework integrating multimodal perception, world modeling, and structured strategies. Front. Robot. AI. 2025;12:1668910. doi: 10.3389/frobt.2025.1668910. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Zhang Q., Zheng W., Cai L., Wan J., Liang Y. Trehalose-toughened transparent hydrogel with fast-healing and anti-freezing capability for precision human motion detection. Appl. Mater. Today. 2026;48:103003. [Google Scholar]
- 6.Patraucean V., Smaira L., Gupta A., Recasens A., Markeeva L., Banarse D., Koppula S., Malinowski M., Yang Y., Doersch C., et al. Perception test: A diagnostic benchmark for multimodal video models. Adv. Neural Inf. Process. Syst. 2023;36:42748–42761. [Google Scholar]
- 7.Wang K., Lv J., Ma Q., Cheng D., Wang S., Ran J. Liquid Metal/Polyurethane Core-sheath Coaxial Conductive Filaments for Human Motion Detection and Thermal Management. J. Sci. Adv. Mater. Devices. 2026:101114. doi: 10.1016/j.jsamd.2026.101114. [DOI] [Google Scholar]
- 8.Wu Z., Zheng J., Ren X., Vasluianu F.A., Ma C., Paudel D.P., Van Gool L., Timofte R. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE; Piscataway, NJ, USA: 2024. Single-model and any-modality for video object tracking; pp. 19156–19166. [Google Scholar]
- 9.Ji C., Yang J., Wang X., Wang Y., Li Y., Liang X., Zhao J., Cao L. Mechano-Adaptive Anti-Swelling Hydrogels: Stable Motion-Interfacing and Healthcare Monitoring. Adv. Mater. Technol. 2026:e02048. doi: 10.1002/admt.202502048. [DOI] [Google Scholar]
- 10.Grewal M.S. International Encyclopedia of Statistical Science. Springer; Berlin/Heidelberg, Germany: 2025. Kalman filtering; pp. 1285–1289. [Google Scholar]
- 11.Chen J., Zhou Z., Kim B.J., Zhou Y., Wang Z., Wan T., Yan J., Kang J., Ahn J.H., Chai Y. Optoelectronic graded neurons for bioinspired in-sensor motion perception. Nat. Nanotechnol. 2023;18:882–888. doi: 10.1038/s41565-023-01379-2. [DOI] [PubMed] [Google Scholar]
- 12.Xu W., Zhou G., Zhou Y., Zou Z., Wang J., Wu W., Li X. A vision-based tactile sensing system for multimodal contact information perception via neural network. IEEE Trans. Instrum. Meas. 2024;73:5026411. doi: 10.1109/TIM.2024.3428647. [DOI] [Google Scholar]
- 13.Hussain B., Guo J., Sidra F., Fang B., Chen L., Uddin S. Enhancing spatial awareness via multi-modal fusion of cnn-based visual and depth features. Int. J. Ethical AI Appl. 2025;1:13–27. [Google Scholar]
- 14.Zhang L., Zhang Y., Ma X. Proceedings of the ICMLCA 2021, 2nd International Conference on Machine Learning and Computer Application. VDE; Hamburg, Germany: 2021. A new strategy for tuning ReLUs: Self-adaptive linear units (SALUs) pp. 1–8. [Google Scholar]
- 15.Li C., Liang W., Yin F., Zhao Y., Zhang Z. Semantic information guided multimodal skeleton-based action recognition. Inf. Fusion. 2025;123:103289. doi: 10.1016/j.inffus.2025.103289. [DOI] [Google Scholar]
- 16.Suo X., Tang W., Li Z. Motion capture technology in sports scenarios: A survey. Sensors. 2024;24:2947. doi: 10.3390/s24092947. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Patruno C., Renò V., Cicirelli G., D’Orazio T. Multimodal people re-identification using 3D skeleton, depth, and color information. IEEE Access. 2024;12:174689–174704. doi: 10.1109/access.2024.3504738. [DOI] [Google Scholar]
- 18.Cabaraux P., Mongold S., Georgiev C., Carlak E.Y., Garbusinski J., Naeije G., Vander Ghinst M., Bourguignon M. The confusing role of visual motion detection acuity in postural stability in young and older adults. Gait Posture. 2025;119:63–69. doi: 10.1016/j.gaitpost.2025.02.027. [DOI] [PubMed] [Google Scholar]
- 19.Happee R., Kotian V., De Winkel K.N. Neck stabilization through sensory integration of vestibular and visual motion cues. Front. Neurol. 2023;14:1266345. doi: 10.3389/fneur.2023.1266345. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Wang Y., Cheng X., Jabban L., Sui X., Zhang D. Motion intention prediction and joint trajectories generation toward lower limb prostheses using EMG and IMU signals. IEEE Sens. J. 2022;22:10719–10729. [Google Scholar]
- 21.Zhang W., Zhang C., Gao Y., Jin Z. KineticsSense: A Multimodal Wearable Sensor Framework for Modeling Lower-Limb Motion Kinetics. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2025;9:1–35. doi: 10.1145/3749462. [DOI] [Google Scholar]
- 22.Purohit P., LaCourse J.R. Multimodal EMG–IMU Sensor Fusion with Dual-Output LSTM for Fatigue Estimation During Neonatal Chest Compressions. Biomed. Eng. Adv. 2026;11:100209. [Google Scholar]
- 23.Khalil Y.H., Mouftah H.T. LiCaNet: Further enhancement of joint perception and motion prediction based on multi-modal fusion. IEEE Open J. Intell. Transp. Syst. 2022;3:222–235. doi: 10.1109/OJITS.2022.3160888. [DOI] [Google Scholar]
- 24.Li Y., Zhang J. SL-GCNN: A graph convolutional neural network for granular human motion recognition. IEEE Access. 2024;13:12373–12387. doi: 10.1109/ACCESS.2024.3514082. [DOI] [Google Scholar]
- 25.Wang J., Li X., Cui Y., Mai K., Wang Y., Song M., Wang C., Yi Z., Wu X. Soft sensor-based deep temporal-graph convolutional network for applications in human motion tracking. IEEE Sens. J. 2024;24:23117–23128. doi: 10.1109/JSEN.2024.3401678. [DOI] [Google Scholar]
- 26.Ni J., Tang H., Haque S.T., Yan Y., Ngu A.H. A survey on multimodal wearable sensor-based human action recognition. arXiv. 2024 doi: 10.48550/arXiv.2404.15349.2404.15349 [DOI] [Google Scholar]
- 27.Huang X., Xue Y., Ren S., Wang F. Sensor-based wearable systems for monitoring human motion and posture: A review. Sensors. 2023;23:9047. doi: 10.3390/s23229047. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Ram K.S., Hoon P.J., Yeon H.J. A Hybrid Noise Reduction And Normalization Framework For Improving Multimodal Sensor Data Quality In Real-Time Systems. J. Technol. Inform. Eng. 2025;4:350–368. doi: 10.51903/jtie.v4i3.440. [DOI] [Google Scholar]
- 29.Duan J., Zhuang L., Zhang Q., Zhou Y., Qin J. Multimodal perception-fusion-control and human–robot collaboration in manufacturing: A review. Int. J. Adv. Manuf. Technol. 2024;132:1071–1093. doi: 10.1007/s00170-024-13385-2. [DOI] [Google Scholar]
- 30.Sun S., Liu D., Dong J., Qu X., Gao J., Yang X., Wang X., Wang M. Proceedings of the 31st ACM International Conference on Multimedia. ACM; New York, NY, USA: 2023. Unified multi-modal unsupervised representation learning for skeleton-based action understanding; pp. 2973–2984. [Google Scholar]
- 31.Zhou X., Chen S., Ren Y., Zhang Y., Fu J., Fan D., Lin J., Wang Q. Atrous Pyramid GAN Segmentation Network for Fish Images with High Performance. Electronics. 2022;11:911. doi: 10.3390/electronics11060911. [DOI] [Google Scholar]
- 32.Gao D., Zhou L., Ji L., Zhu L., Yang Y., Shou M.Z. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE; Piscataway, NJ, USA: 2023. Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering; pp. 14773–14783. [Google Scholar]
- 33.Tang J., Hu J., Huang W., Shen S., Pan J., Wang D., Ding Y. Spatio-Temporal Graph Convolution Transformer for Video Question Answering. IEEE Access. 2024;12:131664–131680. doi: 10.1109/ACCESS.2024.3445636. [DOI] [Google Scholar]
- 34.Wang R., Jo W., Zhao D., Wang W., Gupte A., Yang B., Chen G., Min B.C. Husformer: A multimodal transformer for multimodal human state recognition. IEEE Trans. Cogn. Dev. Syst. 2024;16:1374–1390. doi: 10.1109/TCDS.2024.3357618. [DOI] [Google Scholar]
- 35.Song K., Wu Y., Xiong H., Zhu Q., Gu F., Fan J. MTFT: Multimodal Temporal Fusion Transformer for 3D Human Pose and Shape Estimation. Signal Image Video Process. 2025;19:1132. doi: 10.1007/s11760-025-04741-0. [DOI] [Google Scholar]
- 36.Wen B. A multimodal transformer framework with biomechanical constraints for injury prediction and human motion analysis. J. Comput. Methods Sci. Eng. 2025:14727978251348632. doi: 10.1177/14727978251348632. [DOI] [Google Scholar]
- 37.Xia M., Wang J., Liu N., Xie Y., Yue Z., Shi C., Chen W., Mou X. Multimodal Spatiotemporal Feature-Based Human Motion Pattern Recognition With CNN-Transformer-Attention Framework. IEEE Internet Things J. 2025;12:43883–43895. doi: 10.1109/JIOT.2025.3599403. [DOI] [Google Scholar]
- 38.Li S., Tang H. Multimodal alignment and fusion: A survey. arXiv. 2024 doi: 10.1007/s11263-025-02667-1.2411.17040 [DOI] [Google Scholar]
- 39.Huang K., Shi B., Li X., Li X., Huang S., Li Y. Multi-modal sensor fusion for auto driving perception: A survey. arXiv. 2022 doi: 10.48550/arXiv.2202.02703.2202.02703 [DOI] [Google Scholar]
- 40.Lv Y., Liu Z., Chang X. Uncertainty-Aware Audio-Visual Segmentation With Dynamic Fusion for Multimodal Alignment. IEEE Trans. Multimed. 2026:1–14. doi: 10.1109/TMM.2026.3651123. [DOI] [Google Scholar]
- 41.Park Y., Woo S., Lee S., Nugroho M.A., Kim C. Cross-modal alignment and translation for missing modality action recognition. Comput. Vis. Image Underst. 2023;236:103805. doi: 10.1016/j.cviu.2023.103805. [DOI] [Google Scholar]
- 42.Wang Q., Zhang K., Asghar M.A. Skeleton-based ST-GCN for human action recognition with extended skeleton graph and partitioning strategy. IEEE Access. 2022;10:41403–41410. doi: 10.1109/ACCESS.2022.3164711. [DOI] [Google Scholar]
- 43.Chen Y., Zhang Z., Yuan C., Li B., Deng Y., Hu W. Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE; Piscataway, NJ, USA: 2021. Channel-wise topology refinement graph convolution for skeleton-based action recognition; pp. 13359–13368. [Google Scholar]
- 44.Siami-Namini S., Tavakoli N., Namin A.S. Proceedings of the 2019 IEEE International Conference on Big Data (Big Data) IEEE; Piscataway, NJ, USA: 2019. The performance of LSTM and BiLSTM in forecasting time series; pp. 3285–3292. [Google Scholar]
- 45.Xu H., Zhou P., Tan R., Li M., Shen G. Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems. ACM; New York, NY, USA: 2021. Limu-bert: Unleashing the potential of unlabeled data for imu sensing applications; pp. 220–233. [Google Scholar]
- 46.Hewage P., Behera A., Trovati M., Pereira E., Ghahremani M., Palmieri F., Liu Y. Temporal convolutional neural (TCN) network for an effective weather forecasting using time-series data from the local weather station. Soft Comput. 2020;24:16453–16482. doi: 10.1007/s00500-020-04954-0. [DOI] [Google Scholar]
- 47.Snoek C.G., Worring M., Smeulders A.W. Proceedings of the 13th Annual ACM International Conference on Multimedia. ACM; New York, NY, USA: 2005. Early versus late fusion in semantic video analysis; pp. 399–402. [Google Scholar]
- 48.Gadzicki K., Khamsehashari R., Zetzsche C. Proceedings of the 2020 IEEE 23rd International Conference on Information Fusion (FUSION) IEEE; Piscataway, NJ, USA: 2020. Early vs late fusion in multimodal convolutional neural networks; pp. 1–6. [Google Scholar]
- 49.Tsai Y.H.H., Bai S., Liang P.P., Kolter J.Z., Morency L.P., Salakhutdinov R. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics; Stroudsburg, PA, USA: 2019. Multimodal transformer for unaligned multimodal language sequences; pp. 6558–6569. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.De Geest R., Gavves E., Ghodrati A., Li Z., Snoek C., Tuytelaars T. Proceedings of the European Conference on Computer Vision. Springer; Berlin/Heidelberg, Germany: 2016. Online action detection; pp. 269–284. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.




