Abstract
Human Activity Recognition (HAR) plays a significant role in the field of health monitoring. By accurately identifying common activities such as walking, go up/down stairs, sitting, standing, and lying down, continuous tracking, analysis of an individual’s behavioral state can be achieved. This is of great importance for health monitoring and intelligent healthcare. However, due to the noise in data collected by wearable sensors and the variability in data distribution, existing HAR methods still face limitations in accuracy and generalization ability, making it difficult to maintain stable and reliable early-warning performance across diverse motion scenarios and individual differences. To address these challenges, a human activity recognition method ASSAFormer is proposed in this paper. ASSAFormer integrates mode decomposition, heuristic optimization algorithms, and an improved Transformer for health monitoring. During data preprocessing, variational mode decomposition (VMD) is employed to filter noise from sensor data sequences, while the Whale Optimization Algorithm (WOA) optimizes the number of decomposition modes and the penalty factor, thereby mitigating the parameter sensitivity issue in mode decomposition. In terms of network architecture design, Adaptive Sparse Self-Attention (ASSA) and Contrastive Normalization (ContraNorm) are introduced into the vanilla Transformer. Firstly, the self-attention mechanism in Transformer is prone to introducing low correlation interference information and leading to overfitting. Therefore, an Adaptive Sparse Self-Attention (ASSA) mechanism is proposed. The Sparse Self-Attention (SSA) branch of ASSA filters query-key matching scores, allowing only highly relevant information to pass through, thereby reducing noise interference. Meanwhile, the Dense Self-Attention (DSA) of ASSA branch retains weakly relevant yet useful information that might be overlooked due to excessive sparsification. Secondly, Contrastive Normalization (ContraNorm) is introduced to alleviate dimensional collapse, enabling better implicit dispersion of representations in the feature space. The comparative experiments demonstrate that the proposed method achieves the best performance on both the UCI dataset and URFD dataset, respectively. And the ablation studies further validate the effectiveness of the improved modules.
Keywords: Deep learning, Human activity recognition, ASSA, ContraNorm
Subject terms: Engineering, Mathematics and computing
Introduction
Human Activity Recognition (HAR) is an important research direction in the field of artificial intelligence, with broad applications in intelligent health monitoring, behavioral-assisted diagnosis and treatment, smart home interaction, sports training analysis, and other scenarios. For example, by real-time recognition of behaviors such as elderly falls and abnormal gait, HAR can assist in achieving efficient telemedicine and elderly care systems. In smart sports, accurate identification of exercise types and intensity can be used for scientific training and fatigue warnings. To achieve automatic recognition of human activities, researchers have proposed various data-driven methods with particular attention to wearable sensor-based recognition approaches. Compared to video surveillance and visual recognition technologies, wearable sensors have advantages such as better privacy protection, lower energy consumption, and stronger environmental robustness, making them suitable for activity recognition tasks in complex lighting, occluded, or unattended scenarios. Consequently, many studies explore different types of HAR methods using wearable sensor data. The mainstream models include CNNs, LSTMs, and Transformers, all of which have demonstrated promising results.
With the widespread adoption of wearable devices and mobile terminals, Accelerometer-based Human Activity Recognition (AHAR) has emerged as a key technology in health monitoring. By capturing dynamic characteristics of human movement such as amplitude, frequency, and direction, acceleration data can reflect individual behavioral patterns, offering unique advantages for real-time monitoring scenarios. Notably, accelerometers are now extensively integrated into smartphones and smart wristbands, enabling continuous collection of triaxial (X, Y, Z) motion data. This widespread integration has established a substantial data foundation for activity recognition, addressing many limitations of conventional monitoring approaches while maintaining measurement consistency across different devices and users.
Current research faces certain challenges and limitations. When dealing with different device wearing positions, cross-device and cross-individual situations, activity recognition performance significantly declines, exhibiting problems such as noise interference, insufficient recognition accuracy, and inadequate generalization capability. To address the limitations of existing models, a human activity recognition method ASSAFormer is propsoed in this paper. By combining heuristic optimization algorithms and variational mode decomposition modules to reduce data noise, and improving the Transformer model, ASSAFormer avoids overfitting and noise interference during training. Firstly, to address noise interference caused by factors like different device wearing positions, the Whale Optimization Algorithm (WOA) is introduced to optimize the penalty factor and number of decomposition modes in Variational Mode Decomposition (VMD) for noise reduction, thereby improving the signal-to-noise ratio. Secondly, an improved Transformer structure is proposed with ASSA and ContraNorm, which prevents information loss during excessive dimensional compression and avoids introducing low-relevance interference that could lead to overfitting, thereby enhancing the model’s generalization performance for different individuals and device position variations. Finally, the proposed method is evaluated on multiple public human activity recognition datasets The experimental results demonstrate its advantages in both accuracy and robustness.
To summarize, the main contributions of this paper are summarized as follows:
A human activity signal preprocessing method integrating Whale Optimization Algorithm (WOA) and Variational Mode Decomposition (VMD) is introduced. To address data noise and mode aliasing issues caused by variations in device placement, WOA is employed to adaptively optimize both the penalty factor and mode number in VMD, effectively enhancing the stability and accuracy of signal decomposition. This approach provides high-quality input for subsequent feature extraction and modeling.
An Adaptive Sparse Self-Attention (ASSA) mechanism is introduced to enhance the Transformer’s discriminative capability for activity recognition tasks. This module consists of two complementary branches: a Sparse Self-Attention (SSA) branch that filters high-relevance features to suppress interference, and a Dense Self-Attention (DSA) branch that preserves potentially weak but relevant information. Their synergistic operation significantly improves the model’s feature extraction capacity and generalization performance in noisy environments.
A Contrastive Normalization (ContraNorm) strategy is introduced to mitigate over-smoothing in Transformers. ContraNorm performs fine-grained normalization in the dimensional space to reduce feature degradation caused by excessive information compression, thereby enhancing the model’s representational diversity and robustness in spatial expression while further improving overall recognition performance.
Comprehensive experimental validation was conducted on 2 activity recognition datasets. Results demonstrate that the proposed method achieves significant improvements over state-of-the-art approaches across all metrics.
The rest of this paper is organized as follows: the Related Work section provides a comprehensive literature review of current human activity recognition researches. The Method section introduces the methodological framework of this study, including the Overall Structure, Whale Optimization based Variational Mode Decomposition, and the network architecture. In the Experiments chapter, the key experiments conducted in this research are presented. This includes descriptions of the datasets, comparative experiments, ablation studies, result discussions, and an analysis of the model’s limitations. Finally, the Conclusion section revisits all aspects of this study, offering a summary and future outlook.
Related work
Human Activity Recognition (HAR) refers to an artificial intelligence technology that utilizes sensors, cameras, and other data sources to identify and classify human behaviors (such as walking, running, sitting, falling, etc.), demonstrating significant application value. It can be applied in smart healthcare and health monitoring to promptly detect hazardous behaviors like falls, thereby preventing accidents among elderly populations1. Furthermore, HAR proves valuable for visual surveillance systems2, where it can identify dangerous human activities (e.g., fighting, theft, or fleeing) and trigger warning mechanisms. Additionally, HAR finds applications in numerous other domains, including human-computer interaction3,4, fitness and sports5, and video retrieval6. From a data modality perspective, HAR can be categorized into: visual modalities and non-visual modalities. Visual modalities primarily include video and image data captured by cameras, offering advantages such as rich information content and strong intuitiveness. However, they also present challenges including privacy concerns, occlusion interference, and high dependence on lighting conditions. In contrast, non-visual modalities typically involve time-series signals collected by wearable sensors (e.g., accelerometers, gyroscopes), which provide benefits such as low power consumption, strong environmental robustness, and enhanced privacy protection. Consequently, non-visual approaches have gained substantial attention for daily activity recognition tasks.
Research on single-modality human activity recognition methods
Based on the aforementioned different types of data, researchers have proposed various single-modality human activity recognition methods. Below, we will separately introduce typical approaches and their application characteristics under each single modality. In the early stages of research, most human activity recognition efforts focused on using RGB images or videos as inputs for HAR. RGB-based methods refer to techniques that utilize RGB videos or image sequences as input, leveraging three-channel color visual information for activity recognition. These methods belong to the visual modality and represent one of the earliest and most widely adopted technologies in the HAR field. While most current work concentrates on using videos for human activity recognition, a few studies still employ static images. Simonyan K. et al. proposed a two-stream ConvNet architecture for video action recognition, capable of simultaneously capturing appearance information from video frames and motion information between frames. By employing temporal networks on dense optical flow, effective training with limited data, and multi-task learning to enhance data volume and performance, this method achieved outstanding results on mainstream action recognition datasets such as UCF-101 and HMDB-51, significantly surpassing previous deep network approaches7. Du W. et al. introduced an end-to-end Recurrent Pose Attention Network (RPAN), which adaptively captures spatiotemporal features of human pose evolution through a pose attention mechanism, thereby improving video action recognition performance. This method learns robust body part features by sharing attention parameters and constructs highly discriminative temporal representations through part pooling, while also enabling video pose estimation. Experimental results on the Sub-JHMDB and PennAction datasets demonstrated that RPAN outperforms existing mainstream methods8.
Research on HAR based on visual modalities
Human activity recognition can be categorized by data types into visual modalities and non-visual modalities9. Common visual modalities include RGB10, skeleton11, depth12, infrared sequences13, point clouds14, and event streams15. Generally, visual modalities are highly effective for human activity recognition. RGB video is the most commonly used data type, frequently employed in surveillance systems. Chéron et al.16 proposed P-CNN that extracts RGB and optical flow image regions around human joint points, which are then fed into a two-stream network for feature aggregation and recognition. This method significantly improved action recognition performance on both the JHMDB and MPII Cooking datasets. Wang et al.17 proposed the Temporal Segment Network (TSN), which effectively models long-range temporal structures of actions through sparse sampling and video-level supervision, achieving state-of-the-art performance on multiple video action recognition datasets. Girdhar et al.18 introduced a novel video representation method combining spatiotemporal feature aggregation with two-stream networks, enabling end-to-end whole-video classification and significantly outperforming the original two-stream architecture on multiple action recognition benchmarks. Diba et al.19 developed the Temporal Linear Encoding (TLE) method, which embeds spatiotemporal information from entire videos into CNNs for end-to-end learning, achieving more compact and discriminative feature representations that surpassed existing methods on several action recognition datasets. These approaches extended the two-stream framework to extract long-term, video-level information for HAR. Skeleton data concisely describes the motion of human joints and is suitable for action recognition that doesn’t require scene context. Du et al.20 proposed a hierarchical recurrent neural network that achieves efficient and accurate action recognition by modeling temporal features through partitioned human skeleton representations, outperforming existing methods on multiple datasets. Point clouds and depth data capture three-dimensional structure and distance information, often used for activity recognition in robotics and autonomous driving.Harville and Li developed planar view templates using a single stereo camera to achieve integrated human tracking and activity recognition. Roh, Shin, and Lee proposed volumetric motion templates and projection template methods to realize viewpoint-invariant human action recognition. Wang, Li, Gao, Tang, and Ogunbona created a depth-image-based dynamic graph representation combined with convolutional neural networks, achieving state-of-the-art 3D action recognition performance across multiple large-scale datasets21–23. Infrared data can operate in dark environments, while event streams highlight human motion by reducing redundant information, making them equally suitable for activity recognition.Jiang et al.24 proposed a dual-stream 3D convolutional neural network incorporating discriminative encoding layers for action recognition in infrared videos, achieving state-of-the-art recognition accuracy on the InfAR dataset. In addition to visual modalities, non-visual modalities also serve as important data sources for human activity recognition.
Research on HAR based on non-visual modalities
However, with the rapid advancement of sensor technology and continuous cost reduction, an increasing number of studies based on non-visual data have emerged. Among these, acceleration data has become a crucial research direction in activity recognition due to its ease of acquisition, device portability, and privacy-friendly characteristics. Acceleration-based HAR methods are widely applied in scenarios such as wearable devices, intelligent health monitoring, and sports analysis, demonstrating excellent real-time performance and robustness. Common non-visual modalities include audio25, acceleration26, radar27, and WiFi27. These methods do not visually display human behavior directly, making them particularly suitable for privacy-sensitive applications. Audio data is effective for temporal localization and action recognition in time series, while acceleration data enables fine-grained human activity recognition. Additionally, radar signals can penetrate obstacles to achieve through-wall activity monitoring, providing technical support for specialized surveillance scenarios. WiFi signals can detect human activity states by analyzing wireless signal variations, offering advantages of being unobtrusive and low-cost. In recent years, the proliferation of smart devices and advancements in sensing technologies have led to increasingly diverse non-visual data acquisition methods, expanding possibilities for human activity recognition.
Research on HAR based on acceleration
Acceleration-based human activity recognition methods currently represent one of the most widely used approaches in the field of human activity recognition. Triaxial accelerometers can record real-time acceleration changes of the human body along the x, y, and z axes, providing rich temporal features for motion analysis. Although individuals vary in body size and proportions, the same actions typically exhibit high consistency in acceleration signals with minimal internal differences, which helps enhance the stability and accuracy of action recognition. Chen Y et al.28 proposed an acceleration signal-based human activity recognition method using a Convolutional Neural Network (CNN), designing and adapting convolutional kernels to suit the characteristics of triaxial acceleration signals. Experiments on a large dataset containing eight typical activities with 31,688 samples demonstrated that this CNN model achieved a high average recognition accuracy of 93.8% without requiring additional feature extraction. Panwar Madhuri et al.29 introduced a CNN-based human activity recognition approach using acceleration signals. The method collected data from four subjects via a single wrist-worn accelerometer, with results showing an average recognition rate of 99.8% for the deep learning model—significantly outperforming traditional methods like K-means clustering, Linear Discriminant Analysis, and Support Vector Machines. Andrey Ignatov30 proposed a method combining Convolutional Neural Networks (CNN) with statistical principles, enabling instant activity recognition for arbitrary users. The study also analyzed the impact of time-series length on recognition accuracy, limiting sequences to within 1 second to support continuous real-time classification. Finally, the method was evaluated on two accelerometer datasets—WISDM and UCI with 36 and 30 users respectively—and its performance was verified through cross-dataset experiments. Results demonstrate that the model achieves robust recognition effectiveness while maintaining low computational costs and eliminating manual feature engineering. J. Wang et al.31 proposed a behavior recognition framework called the Combined Bidirectional LSTM-CNN Network. This framework uses an optimized convolutional neural network to automatically extract features from raw sensor data and captures dynamic temporal features through a bidirectional long short-term memory network (Bidirectional LSTM). Addressing issues such as manual errors and time-consuming data collection in supervised human activity recognition with multimodal and high-dimensional sensor data, this method improves recognition accuracy by about 8% while enhancing the model’s robustness and generalization ability, providing an effective solution for accurate human activity recognition. J. Lu et al.m32 proposed a human activity recognition framework based on a single triaxial accelerometer, aiming to address the challenges of wearable sensor applications in daily health monitoring. This method encodes triaxial acceleration signals into three-channel images using an improved recurrence plot (RP) technique and employs a lightweight residual neural network for image classification. The study introduces for the first time an enhanced recurrence plot to resolve confusion problems in conventional methods, significantly improving system performance. Evaluation results on both newly established datasets and public benchmarks demonstrate that the framework achieves comparable accuracy and efficiency to state-of-the-art methods while exhibiting superior robustness against noise interference and under low sampling rate conditions33.
In conclusion, human activity recognition (HAR) methods based on acceleration signals have emerged as a pivotal research area in this field, owing to their sensor portability, privacy preservation capabilities, and robustness to environmental variations. Recent advances in deep learning have witnessed widespread adoption of various models, including convolutional neural networks (CNNs), recurrent neural networks (RNNs) and their variants, for automated feature extraction and temporal modeling of acceleration data, leading to significant improvements in recognition accuracy and system generalization performance. Concurrently, innovative approaches addressing signal encoding techniques, model lightweighting, and real-time processing requirements have continuously emerged, effectively overcoming challenges such as complex data preprocessing, noise interference, and cross-user adaptability. Particularly noteworthy is the successful application of Transformer architectures in sequence modeling tasks. Building upon this advancement, our study introduces a Transformer-based model that more effectively captures long-range dependencies and multi-scale features in acceleration signals, thereby further enhancing both the recognition accuracy and robustness of the HAR system.
Methods
Overall architecture
Human Activity Recognition (HAR), as a key technology in health monitoring, aims to achieve continuous tracking and intelligent analysis of individual behaviors through accurate classification of common activities such as walking, climbing stairs, sitting, standing, and lying down. Network model-based HAR typically involves four core steps: data acquisition, data processing, model training, and model evaluation. During the data acquisition phase, multi-dimensional signals including acceleration, angular velocity, and gravity are collected via sensors, providing rich temporal information for subsequent recognition. However, noise in sensor data and the diversity of inter-individual data distribution pose significant challenges to accuracy and model generalization. To address these issues, this paper proposes ASSAFormer, a novel HAR method based on mode decomposition, heuristic optimization, and an improved Transformer architecture. First, Variational Mode Decomposition (VMD) is employed to filter noise from raw sensor signals, while the Whale Optimization Algorithm (WOA) dynamically adjusts decomposition parameters to enhance the robustness of preprocessing. For network design, the Transformer is enhanced with Adaptive Sparse Self-Attention (ASSA) and Contrast Normalization (ContraNorm) to mitigate overfitting and over-smoothing in self-attention. Specifically, ASSA filters highly relevant information through a sparse branch while retaining weakly correlated features via a dense branch, refining feature representation. ContraNorm promotes implicit dispersion in feature space through mild dimensional collapse. Experimental results demonstrate superior recognition performance and strong generalization across multiple public datasets, highlighting the method’s broad potential for intelligent health monitoring. The overall model architecture is illustrated in Fig. 1, with detailed steps and remaining work described below.
Fig. 1.
The overall flowchart.
Whale optimization based variational mode decomposition
Overall workflow of data decomposition
Although traditional denoising methods are widely applied, they commonly suffer from limitations such as threshold selection relying on experience, unstable decomposition results, or difficulty in balancing high-frequency and low-frequency information. For example, wavelet threshold denoising methods require the selection of appropriate wavelet bases and thresholds, with parameter selection depending on experience, and may produce artifacts when processing complex non-stationary signals. Low-pass filtering methods can remove high-frequency noise, but if the signal itself contains useful high-frequency components, the denoising effectiveness will be weakened. Meanwhile, moving average filtering can only smooth short-term fluctuations and tends to attenuate effective signal details, resulting in information loss. Therefore the WOA-VMD is propsoed in this paper. It has stronger robustness and accuracy under complex noise backgrounds with global optimization and adaptive decomposition. Its denoising is more targeted—after decomposition, noise modes can be identified and removed based on the signal’s energy distribution and correlation indicators, and then the signal is reconstructed. Compared to simple filtering, WOA-VMD can maximally preserve the true signal structure while effectively suppressing noise.Moreover, WOA-VMD can also discover hidden patterns, preserve and analyze meaningful IMFs, thereby revealing the intrinsic structure of the signal. The underlying periodicity, local patterns, or transient regularities can also be presented more stably.
The complete data decomposition process proceeds as follows: First, the original time series signal to be decomposed is input. Then, the Whale Optimization Algorithm (WOA) is applied to optimize the two critical parameters in Variational Mode Decomposition ,the mode number K and penalty factor
. Specifically, after initializing WOA’s population size, iteration count, and parameter search range, the algorithm iteratively adjusts parameter combinations. During each iteration, it performs VMD decomposition using the current parameters, calculates the envelope entropy of the resulting modal components as the fitness function, and updates the population positions by simulating whale hunting behavior to search for the optimal solution. Finally, WOA outputs the parameter values that optimize the fitness function. These optimal parameters are then used to conduct the final VMD decomposition of the original signal, producing high-quality multi-modal signal components. The entire process effectively suppresses noise in the signal while accurately capturing its multi-frequency characteristics. Specifically, this is shown in Fig. 2.
Fig. 2.
WOA-VMD flowchart.
Variational mode decomposition
Variational Mode Decomposition (VMD) is a non-recursive adaptive signal decomposition method. VMD decomposes the original signal into multiple Intrinsic Mode Functions (IMFs) of finite bandwidth by constructing a variational model and solving an optimization problem. The VMD method overcomes the endpoint effect and mode aliasing problems of Empirical Mode Decomposition (EMD) and its variants, providing superior mathematical rigor and adaptability. When dealing with non-stationary time series, the original data typically contain multi-scale, multi-noise components, making direct prediction difficult to achieve ideal results. Therefore, signal decomposition techniques have become an important preprocessing method to improve prediction accuracy by effectively extracting latent features from the signal. Compared to traditional methods such as EMD, VMD offers superior noise robustness and is well suited for tasks involving nonlinear, non-stationary signals.The following introduces how the VMD process works.
When using VMD, given an input signal f(t), VMD decomposes it into K modal functions, each modal
is a finite bandwidth signal with center frequency
. The objective of VMD is to minimize the following variational optimization problem:
![]() |
1 |
In the equation,
is the decomposed modal signal,
is the center frequency of each mode, and
is the unit impulse response of the Hilbert transform.
To solve this optimization problem, VMD uses a constrained optimization method and introduces Lagrange multipliers to construct the Lagrange functional:
![]() |
2 |
In the equation,
is a tuning parameter used to control the smoothness of signal decomposition, and
is the Lagrange multiplier. VMD solves this optimization problem using the alternating direction method of multipliers to iteratively optimize
and
until convergence.
Here are the specific steps for VMD:
Initialize parameters: set the number of modes k, adjust parameter alpha, initial mode function
, and convergence accuracy
.Next, the signal is Fourier transformed to obtain the frequency domain representation
.- Update the spectral representation of each mode
:
3 - Update center frequency:

4 - Update Lagrange multipliers:

5 Determine convergence: If the change in the modal component is less than the set threshold
, terminate iteration.
Whale optimization algorithm
The Whale Optimization Algorithm (WOA) is a meta-heuristic optimization algorithm primarily used to address continuous optimization problems. It draws inspiration from the hunting behavior of humpback whales. Compared to other common optimization algorithms, the WOA offers the advantage of global search performance. In this algorithm, each whale represents a potential solution, attempting to find new positions within the search space and optimize based on the best position within the group. The Whale Optimization Algorithm employs two mechanisms to search for prey locations and execute attacks: the first is encircling prey, and the second is creating bubble nets. The entire process is divided into three stages: encircling prey, bubble feeding, and random search for prey.
In the whale optimization algorithm, assume that the whale population size is N, the search space is d-dimensional, and the position of the ith whale in the d-dimensional space can be represented as
, and the position of the prey corresponds to the global optimal solution of the problem.
The first stage is to surround and capture the prey. The process is described as follows:
![]() |
6 |
Among them,
represents the current position of the prey, X(t) represents the current position of the whale individual,
is the coefficient vector that adjusts the distance weight, and
is a random number that ensures the randomness of the algorithm.
![]() |
7 |
Here,
represents the position at the next iteration,
is the coefficient vector governing the amplitude and direction of position updates, and
denotes the distance between the current position and the prey’s location.
![]() |
8 |
Here,
is a linearly decreasing parameter that diminishes from 2 to 0 as the number of iterations increases, and
is a random number that helps maintain diversity and balance between exploration and exploitation during the search process.
![]() |
9 |
Here,
is a random number used to dynamically adjust the distance weighting during the whale’s prey-encircling behavior.
![]() |
10 |
Here,
represents the current iteration count, and
denotes the maximum iteration count, controlling the search scope’s gradual transition from broad exploration to localized exploitation.
The second stage is the bubble hunting behavior of whales, which involves two methods for creating a mathematical model of the bubble net hunting behavior of whales. The first is the contraction and encirclement mechanism, which requires the A value to decrease gradually, and new search locations can be somewhere between the original location and the current best location. The second is the spiral update mechanism, which first calculates the distance between the whale located at (X, Y) and the prey located at
. Then, an equation based on spiral motion is established between the whale and the prey to mimic the spiral-shaped movement associated with the whale, where
. The formula is as follows:
![]() |
11 |
where
denotes the distance between the whale and the prey in the best solution obtained so far, b is a constant for the logarithmic spiral shape,
is a random number in the interval [-1,1], and p is a random number in the interval [0,1].
The third stage is when the whale randomly searches for prey. It searches for targets based on changes in A, where
. At this point, the whale is in the exploration stage and can update its position using the following equation:
![]() |
12 |
Where:
is the randomly selected target vector position.
ASSAFormer
Model structure
The model is designed based on the Transformer architecture, primarily composed of two main modules: the Encoder and Decoder, which demonstrate powerful feature representation and sequence modeling capabilities for multi-channel time series data. The input is a three-dimensional tensor with shape (n, l, k), where n represents batch size, l denotes time series length, and k indicates the number of feature channels. First, the input data undergoes linear mapping and local feature extraction through a 1D convolutional layer (Conv1d) with kernel size 1, followed by ReLU activation to introduce nonlinearity and enhance the model’s fitting capability.The specific structure is illustrated in Fig. 3.
Fig. 3.

ASSAformer model architecture diagram.
The encoder consists of multiple stacked Transformer encoder blocks. Each encoder block includes multi-head self-attention mechanism, ASSA (Adaptive Sparse Self-Attention), and ContraNorm (Contrast Normalization). The multi-head self-attention mechanism can capture dependencies between different time steps in parallel, improving the model’s perception of global information. The window sparse attention mechanism effectively reduces computational complexity while strengthening local information capture by limiting the calculation range of attention. The contrast normalization layer normalizes features to enhance model generalization. Residual connections and Dropout layers are interspersed to alleviate gradient vanishing and prevent overfitting. Through the input embedding layer, the encoder maps the convolutional output to a high-dimensional embedding space, and after deep feature extraction through multiple encoder blocks, generates context-rich encoded representations.
The decoder structure is similar to the encoder but contains two multi-head attention layers: the decoder self-attention layer processes historical information from the decoder input sequence to capture internal sequence dependencies; the encoder-decoder cross-attention layer integrates contextual information from the encoder to enhance the decoder’s prediction capability. The decoder is also equipped with window sparse attention and contrast normalization mechanisms, as well as residual connections and Dropout, ensuring stability and richness of feature representation. After multi-layer stacking, the decoder output possesses powerful context-aware capabilities.
Finally, the decoder’s output is flattened and mapped to the expected output sequence length through a fully connected layer, achieving multi-step time series prediction. This architecture balances local details and global relationships of sequences, combining adaptive attention mechanisms and regularization strategies to effectively improve prediction accuracy and robustness.
Adaptive sparse self-attention
Adaptive Sparse Self-Attention (ASSA) is an adaptive mechanism that combines sparse attention branches (SSA) and dense attention branches (DSA), designed to preserve useful feature interactions while suppressing redundant information. ASSA employs a dual-branch paradigm for adaptive computation, where the sparse branch is introduced to mitigate the negative impact of low query-key matching scores on aggregated features, while the dense branch ensures sufficient information flow through the network to learn discriminative representations. Figure 4 illustrates the architecture of ASSA. The following is a detailed introduction to the two branches:
Fig. 4.

ASSA architecture diagram.
Sparse Self-Attention (SSA) is used to filter out keys unrelated to the query and retain highly matched parts. It uses square ReLU activation (ReLU2) to construct a sparse attention matrix, which is described by the following formula:
![]() |
13 |
Dense Self-Attention (DSA) is used to ensure global information flow and retain useful weakly correlated information that would otherwise be ignored due to excessive sparsification. It is implemented using standard softmax attention:
![]() |
14 |
To combine the advantages of the two types of attention, ASSA introduces learnable weights
to achieve weighted fusion:
![]() |
15 |
The weights are normalized using softmax:
![]() |
16 |
In this way, ASSA can dynamically adjust the ratio of sparse and dense attention according to the task.
ContraNorm
ContraNorm is a regularization layer designed based on the uniformity principle of contrastive learning. By optimizing the distribution between representations, it makes features more uniform, thereby effectively alleviating the oversmoothing and dimension collapse problems in graph neural networks and Transformers. ContraNorm uses effective rank to characterize dimension collapse and as a measurement metric.
Let
denote the matrix,
denote the singular values, and the effective rank calculation formula is as follows:
![]() |
17 |
In the equation,
denotes the ith singular value of matrix H, and r is the number of nonzero singular values. After normalizing all singular values, we obtain the probability distribution
. Based on this distribution, we calculate the Shannon entropy and take its exponential to obtain the effective rank of the matrix. When the singular values are uniformly distributed,
is close to r; when the singular values are highly concentrated,
is close to 1.
At the same time, ContraNorm introduces the idea of uniformity loss as an energy function, updates the representation H through gradient descent, and constructs the regularization layer ContraNorm. The specific calculation formula is as follows:
Let
be a matrix representing the matrix, and define the matrix:
![]() |
18 |
![]() |
19 |
Then the gradient of H is:
![]() |
20 |
Next is the ContraNorm regularization update formula, which treats the above gradient as a regularization term and performs a descent step in the update expression:
![]() |
21 |
To reduce the computational load, ContraNorm uses softmax normalization instead of
, and The standard ContraNorm layer formula is as follows:
![]() |
22 |
In the formula,
represents the input,
represents the output, s represents the stride,
represents the row-normalized similarity matrix, and LayerNorm is used to maintain feature scale stability.
Experiment
Experimental settings
Datasets
There are totally two public datasets in the experiments. The first one is from the UCI HAR Dataset (UCI34) and the second one is the UR Fall Detection Dataset (URFD35).
The experimental data of UCI was generated by 30 volunteers aged 19-48 years. Each participant wore a smartphone (Samsung Galaxy S II) around their waist while performing six activities (walking, climbing stairs, descending stairs, sitting, standing, and lying down). The data set was collected using the device’s built-in accelerometer and gyroscope at a constant rate of 50 Hz, recording 3-axis linear acceleration and 3-axis angular velocity. The dataset was randomly divided into two groups, with 70% of the subjects selected to generate training data and 30% selected to generate test data. The sensor signals (accelerometer and gyroscope) were pre-processed using noise filters and then sampled in a fixed width sliding window of 2.56 seconds with 50% overlap (128 readings per window). The sensor acceleration signals contain both gravitational and body motion components, which are separated into body acceleration and gravity using a Butterworth low-pass filter. The dataset is available at https://archive.ics.uci.edu/dataset/240/human+activity+recognition+using+smartphones.
The UR Fall Detection Dataset (URFD) is a dataset for human motion detection. The URFD dataset contains 70 sequences, including 30 simulated fall events and 40 daily living activity events. Simulated fall events were recorded using two Kinect cameras and corresponding accelerometers, while daily living activities were recorded using a camera and accelerometer mounted on a horizontal surface. The data can be categorized into three classes: normal activity, normal fall, and already fallen. Additionally, the data includes depth maps in PNG16 format, RGB images in PNG8 format, and CSV tables containing timestamps and three-axis accelerometer data35. The dataset is available at https://github.com/ckm-cug/ASSAFormer-URFD.
Among these, the depth value d (unit: millimeters) is calculated using the pixel value P(x, y) and the camera calibration scale factor
:
![]() |
23 |
where the fall sequence is
ADL sequence:
.
All three-axis accelerations need to be converted into acceleration mode length.The formula for calculating the acceleration mode length is
![]() |
24 |
Implementation details
The experiment was conducted in the following server environment: Operating system: Ubuntu 22.04; Processor: 16-core Intel(R) Xeon(R) Platinum 8481C; Equipped with one H20-NVLink GPU (24GB VRAM). The deep learning framework used is PyTorch 2.5.1, running in a Python 3.12 environment with CUDA version 12.4 to fully leverage GPU acceleration performance. And the hyperparameters of ADSSAFormer are presented in Table 1. To ensure the reliability and reproducibility of the results, all experiments were independently conducted five times under identical settings, and the average performance across the five runs is reported.
Table 1.
Hyperparameter Settings.
| Hyperparameter | Value |
|---|---|
| Input sequence length | 100 |
| Prediction horizon | 1 |
| Epochs | 10 |
| Number of input features | 3 |
| Embedding dimension | 32 |
| Hidden layer size | 32 |
| Number of attention heads | 4 |
| Dropout rate | 0.01 |
| Number of encoder/decoder blocks | 2 |
| Learning rate | 0.001 |
| Batch size | 64 |
Evaluation metrics
To evaluate the performance of the model in this study for classification tasks, standard evaluation metrics were used, including accuracy, precision, recall, F1 score,Matthews Correlation Coefficient and Specificity. These metrics comprehensively reflect the model’s ability to correctly classify normal and abnormal instances. The specific calculation formulas are as follows:
Accuracy refers to the proportion of correctly predicted samples in the model as a whole, which is used to measure the overall judgment ability of the classifier on all samples. The calculation formula is as follows:
![]() |
25 |
Precision is the proportion of positive samples that are correctly classified as positive by the model, and is used to measure the reliability of the model when predicting positive classes. The formula is as follows:
![]() |
26 |
The recall rate indicates the number of samples that are correctly predicted as positive by the model out of all samples that are actually positive, and is used to measure the model’s ability to identify positive classes. The calculation formula is as follows:
![]() |
27 |
The F1-score is the harmonic mean of Precision and Recall. When Precision and Recall are in trade-off, F1-score provides a compromise evaluation. The specific calculation formula is as follows:
![]() |
28 |
Specificity is a measure of a model’s ability to correctly identify negative classes (i.e., normal samples), reflecting the model’s performance in avoiding false positives.
![]() |
29 |
The Matthews Correlation Coefficient (MCC) is a metric used to evaluate the performance of binary classification models. It utilizes the counts of four categories (true positives, true negatives, false positives, and false negatives) and performs a comprehensive evaluation. The MCC value ranges from -1 to 1 and is calculated as follows:
![]() |
30 |
Among them, TP is the number of positive classes predicted as positive, TN is the number of negative classes predicted as negative, FP is the number of negative classes incorrectly predicted as positive, and FN is the number of positive classes incorrectly predicted as negative.
Comparative experiments
To comprehensively evaluate the performance of the proposed ASSAFormer model in human action recognition tasks, systematic comparative experiments are designed by selecting various classical deep learning-based action recognition models as benchmarks. These include traditional models such as traditional CNN, LSTM, and SVM, along with recent high-performing Transformer-based models including Reformer, Transformer, and Informer.Meanwhile, recent models from the past two years were also compared, including DCAM-Net36, DeepConvContext37, CMD-HAR38, and MS-GCN-Transformer39. Additionally, multiple mode decomposition methods for signal temporal features , including EMD, CEEMD, and VMD are incorporated to verify the impact of signal preprocessing on model performance. All models are trained and tested under identical data splits and experimental settings to ensure fairness and comparability of results. To ensure the reproducibility of results, each set of experiments is independently repeated 5 times, and the mean ± standard deviation is reported.
As presented in Table 2, 3, 4, 5, the comparative experiments comprehensively evaluate the performance of different models and mode decomposition methods across six metrics: Accuracy, Precision, Recall, Specificity, F1-score, and MCC . First, regarding the comparison between classical deep learning models, improved models with mode decomposition (e.g., CNN-EMD, CNN-CEEMD, CNN-VMD, and their LSTM counterparts) all showed performance enhancements to varying degrees. This confirms the effectiveness of mode decomposition techniques in promoting feature extraction. Among these, VMD (Variational Mode Decomposition) demonstrated particularly outstanding performance, significantly improving model robustness and generalization capability. The results from both UCI and URFD datasets demonstrate that ASSAFormer and its mode decomposition variants achieve optimal performance among all models. On the UCI dataset, ASSAFormer-VMD achieves an accuracy of 91.12%, representing a 7.79% improvement over the baseline Transformer and 5.10% higher than Informer. On the URFD dataset, ASSAFormer-VMD reaches 91.71% accuracy, surpassing Transformer by 8.04% and exceeding Informer-VMD by 1.51%. This indicates that introducing ASSA combined with VMD can significantly enhance the model’s feature representation capability and classification performance. Analyzing the impact of mode decomposition methods, the incorporation of signal decomposition modules yields consistent and stable performance improvements in both Transformer and ASSAFormer models. This suggests that VMD, by decomposing the original signal into different frequency components, can more effectively remove noise and enhance key features, thereby providing clearer and more stable inputs for subsequent attention mechanisms. Moreover, compared with recently proposed models such as DCAM-Net, DeepConvContext, CMD-HAR, and MS-GCN-Transformer, most of these models achieve accuracy concentrated in the 83%–86% range, while the ASSAFormer series generally reaches 88%–91%. Notably, ASSAFormer-VMD demonstrates about 5% improvement over MS-GCN-Transformer, highlighting the significant advantages of the SSA+DSA design. This architecture effectively balances global and local features, enabling the model to achieve stronger generalization performance across different datasets.
Table 2.
Performance Comparison of Classic Deep Learning Models on the UCI Dataset.
| Model | Accuracy | Precision | Recall | Specificity | F1-score | MCC |
|---|---|---|---|---|---|---|
| CNN | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| CNN-EMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| CNN-CEEMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| CNN-VMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| CNN-3B3Conv40 | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| CNN-3B3Conv-EMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| CNN-3B3Conv-CEEMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| CNN-3B3Conv-VMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| LSTM | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| LSTM-EMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| LSTM-CEEMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| LSTM-VMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| LSTM-Acc41 | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| LSTM-Acc-EMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| LSTM-Acc-CEEMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| LSTM-Acc-VMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Acc + SVM-depth41 | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Acc + SVM-depth-EMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Acc + SVM-depth-CEEMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Acc + SVM-depth-VMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Table 3.
Performance Comparison of Transformer-based and recent Models on the UCI Dataset.
| Model | Accuracy | Precision | Recall | Specificity | F1-score | MCC |
|---|---|---|---|---|---|---|
| DCAM-Net36 | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| DeepConvContext37 | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| CMD-HAR38 | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| MS-GCN-Transformer39 | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Reformer42 | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Reformer-EMD43 | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Reformer-CEEMD44 | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Reformer-VMD45 | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Transformer46 | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Transformer-EMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Transformer-CEEMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Transformer-VMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Informer47 | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Informer-EMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Informer-CEEMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Informer-VMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Pure-ASSAFormer | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ASSAFormer-EMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ASSAFormer-CEEMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ASSAFormer-VMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Table 4.
Performance Comparison of Classic Deep Learning Models on URFD Dataset.
| Model | Accuracy | Precision | Recall | Specificity | F1-score | MCC |
|---|---|---|---|---|---|---|
| CNN | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| CNN-EMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| CNN-CEEMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| CNN-VMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| CNN-3B3Conv | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| CNN-3B3Conv-EMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| CNN-3B3Conv-CEEMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| CNN-3B3Conv-VMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| LSTM | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| LSTM-EMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| LSTM-CEEMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| LSTM-VMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| LSTM-Acc | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| LSTM-Acc-EMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| LSTM-Acc-CEEMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| LSTM-Acc-VMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Acc + SVM-depth | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Acc + SVM-depth-EMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Acc + SVM-depth-CEEMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Acc + SVM-depth-VMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Table 5.
Performance Comparison of Transformer-based and recent Models on URFD Dataset.
| Model | Accuracy | Precision | Recall | Specificity | F1-score | MCC |
|---|---|---|---|---|---|---|
| DCAM-Net | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| DeepConvContext | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| CMD-HAR | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| MS-GCN-Transformer | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Reformer | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Reformer-EMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Reformer-CEEMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Reformer-VMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Transformer | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Transformer-EMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Transformer-CEEMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Transformer-VMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Informer | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Informer-EMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Informer-CEEMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Informer-VMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Pure-ASSAFormer | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ASSAFormer-EMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ASSAFormer-CEEMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ASSAFormer-VMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Ablation studies on WOA
The comparative experiments have demonstrated the effectiveness of the proposed ASSAFormer and VMD. This ablation study aims to validate the impact of different optimization algorithms on the performance of Variational Mode Decomposition (VMD)-based signal decomposition methods for human activity recognition tasks.
As presented in Tables 6 and 7, the experiments compare three network architectures—Informer, Transformer, and ASSAFormer—with four optimization algorithms: GWO, SSA, PSO, and WOA. The Whale Optimization Algorithm (WOA) demonstrates superior and stable performance in VMD parameter optimization. On the UCI dataset, ASSAFormer-VMD-WOA achieves the best results with an accuracy of 92.2318%, F1-score of 92.1412%, and MCC of 0.8446, showing significant improvement compared to ASSAFormer-VMD without optimization (accuracy of 91.7127%). On the URFD dataset, this combination also exhibits outstanding performance, attaining an accuracy of 92.8562%, F1-score of 92.7374%, and MCC of 0.8572, representing the best performance among all optimization algorithms. Notably, WOA demonstrates robust generalization capability across different network architectures: Transformer-VMD-WOA achieves an accuracy of 89.5561% with an exceptionally high recall of 99.8364% on the UCI dataset, while Informer-VMD-PSO reaches 91.6829% accuracy on the URFD dataset. In comparison with other optimization algorithms (GWO, SSA, PSO), WOA more effectively balances various performance metrics in most cases, particularly achieving optimal performance when combined with the ASSAFormer architecture. These results validate the effectiveness and robustness of the WOA algorithm in VMD parameter optimization for human activity recognition tasks.
Table 6.
Performance Comparison with Different Optimization Algorithms on the UCI Dataset.
| Model | Accuracy | Precision | Recall | Specificity | F1-score | MCC |
|---|---|---|---|---|---|---|
| Transformer-VMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Transformer-VMD-GWO | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Transformer-VMD-SSA | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Transformer-VMD-PSO | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Transformer-VMD-WOA | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Informer-VMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Informer-VMD-GWO | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Informer-VMD-SSA | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Informer-VMD-PSO | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Informer-VMD-WOA | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ASSAFormer-VMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ASSAFormer-VMD-GWO | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ASSAFormer-VMD-SSA | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ASSAFormer-VMD-PSO | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ASSAFormer-VMD-WOA | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Table 7.
Performance Comparison with Different Optimization Algorithms on the URFD Dataset.
| Model | Accuracy | Precision | Recall | Specificity | F1-score | MCC |
|---|---|---|---|---|---|---|
| Transformer-VMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Transformer-VMD-GWO | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Transformer-VMD-SSA | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Transformer-VMD-PSO | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Transformer-VMD-WOA | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Informer-VMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Informer-VMD-GWO | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Informer-VMD-SSA | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Informer-VMD-PSO | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Informer-VMD-WOA | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ASSAFormer-VMD | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ASSAFormer-VMD-GWO | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ASSAFormer-VMD-SSA | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ASSAFormer-VMD-PSO | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ASSAFormer-VMD-WOA | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Ablation studies on model modules
To comprehensively validate the effectiveness of modules in ASSAFormer, this section designed and conducted four sets of ablation experiments. The first set involves module effectiveness validation, where the Adaptive Sparse Self-Attention (ASSA) module and ContraNorm module were incrementally incorporated into the model to compare performance variations under different configurations. The second set focuses on validating the ASSA module’s effectiveness by comparing it against three other attention mechanisms: Self-Attention, Cross-Attention, and Sparse Attention. The third one evaluates the individual contributions of the Sparse Self-Attention (SSA) and Dense Self-Attention (DSA) branches of ASSA to discusses their combined complexity. The last set evaluates the ContraNorm’s performance with three classical normalization methods: RMSNorm, WeightNorm, and BatchNorm.
Module effectiveness validation
To further verify the effectiveness of each proposed module in ASSAFormer, the ablation analysis focuses on the role of the ASSA module in feature modeling and the impact of the ContraNorm module on model stability and generalization capability. By comparing the model performance under different module combinations, the contribution and synergistic effects of each component on overall performance can be clarified, thereby verifying the rationality and necessity of the model design.
This experiment evaluates the contribution of the Attention-based Spatial and Spectral Adaptive (ASSA) module and the Contrastive Normalization (ContraNorm) module to human activity recognition performance by comparing different module combinations. As presented in Tables 8 and 9, the results from both the UCI dataset and the URFD dataset demonstrate that removing either module leads to performance degradation, confirming both components as critical factors for model improvement. The incorporation of the ASSA module enhances model precision and recall by improving the model’s ability to capture key temporal features. However, the absence of the ContraNorm module causes fluctuations in accuracy and MCC metrics, highlighting its crucial role in feature distribution adjustment and training stability. When both modules are enabled, the model achieves optimal performance across comprehensive metrics including accuracy, F1-score, and MCC. These findings suggest that the ASSA module effectively strengthens temporal feature extraction, while the ContraNorm module ensures stable feature representation learning. Their synergistic combination yields the most robust performance in human activity recognition tasks.
Table 8.
Results of Different Module Combinations on the UCI Dataset.
| ASSA | ContraNorm | Accuracy | Precision | Recall | Specificity | F1-score | MCC |
|---|---|---|---|---|---|---|---|
| ✗ | ✗ | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ✗ | ✓ | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ✓ | ✗ | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ✓ | ✓ | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Table 9.
Results of Different Module Combinationson URFD Dataset.
| ASSA | ContraNorm | Accuracy | Precision | Recall | Specificity | F1-score | MCC |
|---|---|---|---|---|---|---|---|
| ✗ | ✗ | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ✗ | ✓ | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ✓ | ✗ | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ✓ | ✓ | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Ablation studies on ASSA
To analyze the impact of different attention mechanisms on model performance, comparison on four attention structures are conducted in this section. The attention mechanisms include Self-Attention, Cross-Attention, Sparse Attention, and the proposed Adaptive Sparse Self-Attention. Tables 10 and 11 present the experimental results on the UCI and URFD datasets, revealing the differences among various attention mechanisms in terms of feature modeling and temporal dependency capture.
Table 10.
Ablation Study Results of Different Attention Mechanisms on the UCI Dataset.
| Model | Accuracy | Precision | Recall | Specificity | F1-score | MCC |
|---|---|---|---|---|---|---|
| Self-Attention | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Cross-Attention | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Sparse Attention | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ASSA | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Table 11.
Ablation Study Results of Different Attention Mechanisms on the URFD dataset.
| Model | Accuracy | Precision | Recall | Specificity | F1-score | MCC |
|---|---|---|---|---|---|---|
| Self-Attention | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Cross-Attention | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Sparse Attention | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ASSA | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
The ablation results comprehensively validate the significant advantages of the proposed ASSA mechanism over traditional attention mechanisms. Across both UCI and URFD datasets, ASSA consistently achieves optimal performance with accuracies of 87.4286% and 89.4741%, and MCC values of 0.7414 and 0.7851, respectively, substantially outperforming other attention mechanisms. Comparative analysis reveals distinct performance characteristics and inherent limitations among different attention mechanisms: Sparse Attention exhibits the weakest performance on both datasets with accuracies of only 76%-79%, indicating that pure sparsification strategies lead to critical information loss and severely compromise model discriminative capability. While Self-Attention can capture global dependencies, it demonstrates clear trade-offs between specificity and recall, struggling to simultaneously balance recognition accuracy across different classes. Cross-Attention achieves relatively high recall but at the cost of specificity, reflecting its limitations in balancing multi-class recognition capabilities. In contrast, ASSA adaptively integrates sparse and dense attention branches to achieve the best balance across all metrics while maintaining high accuracy, demonstrating approximately 3-7 percentage point improvements over the second-best Self-Attention and Cross-Attention, with particularly pronounced advantages in critical metrics such as specificity and F1-score. These results convincingly demonstrate that ASSA effectively overcomes the inherent deficiencies of traditional attention mechanisms through dynamic balancing of global and local feature modeling, significantly enhancing the model’s robustness and generalization capability in human activity recognition tasks.
Analysis on individual contributions of SSA and DSA
This section conducted a ablation study to systematically evaluate the individual contributions of the Sparse Self-Attention (SSA) and Dense Self-Attention (DSA) branches and their synergistic effects. The experiments were performed on both UCI and URFD datasets, comparing four module combination schemes: neither SSA nor DSA (baseline model), DSA only, SSA only, and the complete SSA+DSA combination. This experiment aims to demonstrate that the SSA and DSA branches are not redundant designs but rather serve distinct and complementary functions in feature modeling, thereby validating the necessity and effectiveness of the combined ASSA architecture.
The ablation results in Tables 12 and 13 clearly demonstrate the individual contributions of SSA and DSA branches and their synergistic enhancement effects. On the UCI dataset, the baseline model (without SSA and DSA) achieves an accuracy of 83.3348% and MCC of 0.6705. Incorporating DSA alone improves accuracy to 84.9926% (a 1.66 percentage point increase), while SSA alone reaches 84.7483% (a 1.41 percentage point increase). Both branches achieve significant and comparable performance improvements, indicating that each branch captures distinct discriminative features. When SSA and DSA are enabled simultaneously, the model accuracy further increases to 85.7051% with an MCC of 0.6740, representing a 2.37 percentage point improvement over the baseline, which exceeds the contribution of either individual branch. More pronounced synergistic effects are observed on the URFD dataset: the baseline accuracy is 83.6758%, DSA alone and SSA alone achieve 85.1057% and 83.2268% respectively, while the complete ASSA combination reaches 86.4263% with an MCC of 0.7150, representing a 2.75 percentage point and 0.04 MCC improvement over the baseline. Notably, the two branches exhibit complementary contribution patterns—DSA demonstrates superior performance in recall (achieving 86.8776% on the URFD dataset), while SSA excels in specificity. This complementarity is fully leveraged in the combination, resulting in balanced improvements across all metrics. These results demonstrate that the SSA and DSA branches are not redundant structures but rather capture global dependencies and local fine-grained features respectively, and their integration is indispensable for constructing high-performance human activity recognition models.
Table 12.
Ablation Study of ASSA Modules on the UCI Dataset.
| SSA | DSA | Accuracy | Precision | Recall | Specificity | F1-score | MCC |
|---|---|---|---|---|---|---|---|
| ✗ | ✗ | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ✗ | ✓ | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ✓ | ✗ | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ✓ | ✓ | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Table 13.
Ablation Study of ASSA Modules on the URFD Dataset.
| SSA | DSA | Accuracy | Precision | Recall | Specificity | F1-score | MCC |
|---|---|---|---|---|---|---|---|
| ✗ | ✗ | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ✗ | ✓ | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ✓ | ✗ | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ✓ | ✓ | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Ablation studies on ContraNorm
This experiment compares the performance of the ContraNorm normalization module with four classical normalization methods: LayerNorm RMSNorm, WeightNorm, and BatchNorm.
From Tables 14 and 15, the ContraNorm demonstrates performance advantages across both datasets, validating its effectiveness in feature standardization and model stability enhancement. ContraNorm achieves accuracies of 87.4286% and 89.4741% with MCC values of 0.7414 and 0.7851 on the UCI and URFD datasets respectively, comprehensively outperforming all traditional normalization methods. Comparative analysis reveals that classical normalization approaches have inherent limitations: LayerNorm, as the second-best method, achieves accuracies of 85.7051% and 86.4263% on the two datasets, and while exhibiting stable performance, still trails ContraNorm by approximately 2-3 percentage points. RMSNorm and BatchNorm demonstrate comparable performance with accuracies ranging from 83-85%, but are less comprehensive than ContraNorm in balancing various metrics. WeightNorm exhibits the poorest performance on the UCI dataset with only 75.9176% accuracy and severely compromised recall, indicating significant adaptability issues when processing temporal signal features. Notably, ContraNorm not only leads in comprehensive metrics such as accuracy and MCC, but more importantly achieves optimal balance among specificity, recall, and F1-score, benefiting from its contrastive learning-based normalization strategy that enhances feature discriminability and inter-class separability. The experimental results convincingly demonstrate that ContraNorm effectively improves training stability and generalization capability by incorporating contrastive learning mechanisms to optimize feature distributions, serving as a critical component for enhancing human activity recognition performance.
Table 14.
Ablation Study of Different Normalization on the UCI Dataset.
| Model | Accuracy | Precision | Recall | Specificity | F1-score | MCC |
|---|---|---|---|---|---|---|
| LayerNorm | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| RMSNorm | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| WeightNorm | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| BatchNorm | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ContraNorm | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Table 15.
Ablation Study of Different Normalization on the URFD dataset (in%).
| Model | Accuracy | Precision | Recall | Specificity | F1-score | MCC |
|---|---|---|---|---|---|---|
| LayerNorm | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| RMSNorm | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| WeightNorm | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| BatchNorm | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| ContraNorm | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Efficiency comparison of WOA-VMD
To verify the feasibility and real-time performance of the proposed algorithm in practical applications, this section compares the total inference time and accuracy performance of Transformer and ASSAFormer under different denoising methods. By comparing the results on the UCI and URFD datasets, the trade-off relationship between performance and latency of various denoising methods can be intuitively evaluated, thereby verifying the practical value of the proposed WOA-VMD algorithm in balancing accuracy and efficiency.
The experimental results in Tables 16 and 17 convincingly demonstrate the performance advantages of the WOA-VMD preprocessing method over traditional simple denoising techniques, with the introduced computational overhead showing excellent cost-effectiveness relative to the performance gains. On the UCI dataset, compared to conventional denoising methods such as Wavelet Transform (WT), Kalman Filter (KF), and Fast Fourier Transform (FTF), Transformer-VMD-WOA achieves an accuracy of 89.5561%, representing improvements of 5.94, 6.15, and 6.14 percentage points over Transformer-WT, Transformer-KF, and Transformer-FTF respectively, while the inference time only increases from 174-185 seconds to 188 seconds, corresponding to a time overhead increase of approximately 2-8%. The performance advantage becomes even more pronounced with the ASSAFormer architecture: ASSAFormer-VMD-WOA reaches 92.2318% accuracy, surpassing ASSAFormer-WT, ASSAFormer-KF, and ASSAFormer-FTF by 6.85, 5.00, and 3.32 percentage points respectively. Although the inference time increases from 208-231 seconds to 240 seconds (an increase of approximately 4-15%), this overhead is reasonable given the substantial accuracy improvements. More significant performance differences are observed on the URFD dataset: traditional denoising methods exhibit relatively poor performance (accuracies ranging from 71-79%), whereas VMD-WOA elevates the accuracies of Transformer and ASSAFormer to 86.61% and 92.86% respectively, achieving performance improvements of 8-15 percentage points. While ASSAFormer-VMD-WOA requires 278 seconds for inference, representing an approximately 4-11% increase compared to simpler methods, this computational cost is entirely acceptable considering that the accuracy improvement magnitude far exceeds the relative increase in time overhead. These results indicate that simple denoising techniques struggle to effectively separate multi-scale features and capture non-stationary dynamic characteristics when processing complex temporal signals, leading to critical information loss and noise residuals. In contrast, WOA-VMD achieves more precise extraction of discriminative features through adaptive mode decomposition and intelligent parameter optimization, with the substantial performance gains fully justifying the marginally increased computational cost as both reasonable and necessary.
Table 16.
Comparison of total inference time and accuracyfor different denoising method on the UCI Dataset.
| Method | Total Inference Time (s) | Accuracy |
|---|---|---|
| Transformer-WT | 174 | ![]() |
| Transformer-KF | 183 | ![]() |
| Transformer-FTF | 185 | ![]() |
| Transformer-VMD-WOA | 188 | ![]() |
| ASSAFormer-WT | 208 | ![]() |
| ASSAFormer-KF | 217 | ![]() |
| ASSAFormer-FTF | 231 | ![]() |
| ASSAFormer-VMD-WOA | 240 | ![]() |
Table 17.
Comparison of total inference time and accuracyfor different denoising method on the URFD Dataset.
| Method | Total Inference Time (s) | Accuracy |
|---|---|---|
| Transformer-WT | 213 | ![]() |
| Transformer-KF | 224 | ![]() |
| Transformer-FTF | 215 | ![]() |
| Transformer-VMD-WOA | 242 | ![]() |
| ASSAFormer-WT | 251 | ![]() |
| ASSAFormer-KF | 268 | ![]() |
| ASSAFormer-FTF | 254 | ![]() |
| ASSAFormer-VMD-WOA | 278 | ![]() |
Application on resource-constrained IoT devices
To verify the actual performance of the model on edge devices, systematic tests were conducted on three typical IoT platforms in this paper. On the NVIDIA Jetson Nano equipped with a 4-core ARM Cortex-A57 processor and 128-core Maxwell GPU, the model achieves a single inference time of 42 ms, peak memory usage of 175 MB, average power consumption of 5 W, and a throughput of 23.8 samples per second. On the Raspberry Pi 4 (4-core ARM Cortex-A72, 1.5GHz), the inference latency is 65ms, memory usage is 210MB, power consumption is 4 W, and throughput is 15.4 samples per second. For extremely resource-constrained scenarios, tests were also conducted on the ESP32-S3 microcontroller (dual-core XtensaLX7, 240 MHz), where through INT8 quantization and operator fusion optimization, the inference time was controlled within 200 ms, memory usage was reduced to 92 MB, and power consumption was only 0.8 W. The inference latency on all the above platforms is significantly lower than the typical requirements of HAR applications , fully meeting real-time performance requirements.
In terms of energy efficiency, long-term monitoring experiments were conducted using wearable devices equipped with 3000 mAh lithium batteries as the baseline. Under continuous sampling mode (50Hz sampling rate), the Jetson Nano achieves a battery life of approximately 2.1 days with an energy efficiency of 4.76 samples per joule, while the Raspberry Pi 4 extends this to 2.8 days with 3.85 samples per joule. The ESP32-S3 demonstrates superior energy efficiency at 10.88 samples per joule, and through Dynamic Voltage and Frequency Scaling (DVFS) combined with an intermittent sampling strategy (activated once every 5 minutes), its battery life in actual deployment can reach 5-7 days, fully meeting practical requirements. The specific details are shown in Table 18.
Table 18.
Performance Comparison of ASSAFormer on Different IoT Devices.
| Metrics | NVIDIA Jetson Nano | Raspberry Pi 4 | ESP32-S3 |
|---|---|---|---|
| Processor | 4-core ARM Cortex-A57 | 4-core ARM Cortex-A72 | Dual-core XtensaLX7 |
| + 128-core Maxwell GPU | 1.5 GHz | 240 MHz | |
| Inference Time | 42 ms | 65 ms | 115 ms |
| Memory Usage | 175 MB | 210 MB | 92 MB |
| Power Consumption | 5 W | 4 W | 0.8 W |
| Throughput | 23.8 samples/s | 15.4 samples/s | 8.7 samples/s |
| Energy Efficiency | 4.76 samples/J | 3.85 samples/J | 10.88 samples/J |
| Optimization | TensorRT | TensorFlow Lite | INT8 Quantization |
| Strategy | + Operator Fusion | ||
| Battery Life | 2.1 days | 2.8 days | 5–7 days |
| (3000mAh) | (continuous sampling) | (continuous sampling) | (intermittent sampling) |
Discussion
Through systematic ablation experiments and comparative analyses, this study comprehensively validates the effectiveness and necessity of the ASSAFormer model and its core components. The experimental results demonstrate that the WOA-VMD preprocessing method significantly outperforms traditional approaches in signal denoising and feature extraction. While introducing moderate computational overhead, the performance improvements far exceed the time cost increases, fully substantiating the critical role of adaptive mode decomposition and intelligent parameter optimization in processing complex temporal signals. The ablation study of the ASSA module reveals its unique design value: the SSA and DSA branches are not redundant structures but rather capture global dependencies and local fine-grained features respectively, with their synergistic effects achieving substantial performance improvements that exceed the contributions of either individual branch, demonstrating the superiority of the adaptive sparse attention mechanism in balancing global and local feature modeling. Comparisons with traditional attention mechanisms further validate the effectiveness of ASSA, showing that pure sparse attention leads to critical information loss, while Self-Attention and Cross-Attention exhibit clear trade-offs among different metrics, whereas ASSA achieves optimal balance across all performance indicators through dynamic integration. The introduction of the ContraNorm is equally crucial, as its contrastive learning-based normalization strategy not only comprehensively outperforms classical normalization methods in overall performance metrics but, more importantly, enhances feature discriminability and inter-class separability, significantly improving model training stability and generalization capability. These experimental results collectively indicate that ASSAFormer, through the organic integration of WOA-VMD preprocessing, ASSA feature modeling, and ContraNorm normalization, constructs an efficient and robust human activity recognition framework where each component plays an indispensable role, and their synergistic effects are key to achieving superior performance.
Conclusion
This paper aims to address the limitations in accuracy and generalization capability caused by sensor data noise and data distribution variability in human activity recognition tasks. The proposed ASSAFormer integrates mode decomposition, heuristic optimization algorithms, and an improved Transformer architecture. ASSAFormer achieves performance improvements through three core innovations. Firstly, WOA-optimized VMD effectively overcome the limitations of traditional denoising methods. This provides high-quality inputs for subsequent feature modeling. Secondly, the proposed ASSA integrates sparse and dense branches. The SSA branch filters low-correlation interference information. The DSA branch preserves weakly relevant yet useful fine-grained features. This achieves integration of global dependencies and local patterns. The mechanism overcomes the trade-off dilemma between information selectivity and completeness in traditional attention. Finally, the introduced ContraNorm incorporates contrastive learning mechanisms. It enhances feature discriminability and alleviates the dimensional collapse problem. This significantly improves model training stability and generalization capability. Comparative experiments on the UCI and URFD benchmark datasets demonstrate that ASSAFormer achieves optimal performance across all metrics. Systematic ablation studies further validate the effectiveness of individual modules. The experiments also confirm the necessity of their synergistic effects. This research provides an efficient, robust, and practically valuable solution for wearable sensor-based health monitoring.
Future work will explore several directions. We will investigate the model’s generalization capability on more diverse scenarios and larger-scale datasets. We will also explore more lightweight designs to accommodate deployment requirements on resource-constrained edge devices. These efforts will further advance the practical application of intelligent health monitoring technologies.
Author contributions
J.Z. conceived the study and designed the methodology. J.Z. wrote the main manuscript text. B.P. collected and analyzed the data. B.P. contributed to the interpretation of the findings. C.C. supervised the entire project and provided critical revisions to the manuscript. All authors reviewed and approved the final manuscript.
Funding
This research was funded by 2023 National Social Science Fund General Project (23BTY078) and General Research Project of Teaching Reform in Hunan Provincial Colleges and Universities (202502001698).
Data availability
The datasets analysed during the current study are publicly available in the UCI Machine Learning Repository under the identifier ’Human Activity Recognition Using Smartphones’ at https://archive.ics.uci.edu/dataset/240/human+activity+recognition+using+smartphones and in the UR Fall Detection Dataset repository at https://github.com/ckm-cug/ASSAFormer-URFD, with both sources referenced in the Datasets Section.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Durga Bhavani, K. & Ferni Ukrit, M. Design of inception with deep convolutional neural network based fall detection and classification model. Multimed. Tools Appl.83, 23799–23817 (2024). [Google Scholar]
- 2.Lin, W., Sun, M.-T., Poovandran, R. & Zhang, Z. Human activity recognition for video surveillance. In 2008 IEEE international symposium on circuits and systems (ISCAS), 2737–2740 (IEEE, 2008).
- 3.Roitberg, A. et al. Human activity recognition in the context of industrial human-robot interaction. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific, 1–10 (IEEE, 2014).
- 4.Akkaladevi, S. C. & Heindl, C. Action recognition for human robot interaction in industrial applications. In 2015 IEEE International Conference on Computer Graphics, Vision and Information Security (CGVIS), 94–99 (IEEE, 2015).
- 5.Gao, Y. Application of sensor recognition based on artificial intelligence image algorithms in sports and human health. Meas. Sens.33, 101127 (2024). [Google Scholar]
- 6.Jagtap, S. & Chopade, N. B. Object-based image retrieval and detection for surveillance video. Int. J. Electr. Comput. Eng. (IJECE)14, 4343–4351 (2024). [Google Scholar]
- 7.Simonyan, K. & Zisserman, A. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems27 (2014).
- 8.Du, W., Wang, Y. & Qiao, Y. Rpan: An end-to-end recurrent pose-attention network for action recognition in videos. In Proceedings of the IEEE international conference on computer vision, 3725–3734 (2017).
- 9.Sun, Z. et al. Human action recognition from various data modalities: A review. IEEE Trans. Pattern Anal. Mach. Intell.45, 3200–3225 (2022). [DOI] [PubMed] [Google Scholar]
- 10.Kuehne, H., Jhuang, H., Garrote, E., Poggio, T. & Serre, T. Hmdb: a large video database for human motion recognition. In 2011 International conference on computer vision, 2556–2563 (IEEE, 2011).
- 11.Liu, C., Hu, Y., Li, Y., Song, S. & Liu, J. Pku-mmd: A large scale benchmark for continuous multi-modal human action understanding. arXiv preprintarXiv:1703.07475 (2017).
- 12.Ni, B., Wang, G. & Moulin, P. Rgbd-hudaact: A color-depth video database for human daily activity recognition. In 2011 IEEE international conference on computer vision workshops (ICCV workshops), 1147–1153 (IEEE, 2011).
- 13.Gao, C. et al. Infar dataset: Infrared action recognition at different times. Neurocomputing212, 36–47 (2016). [Google Scholar]
- 14.Cheng, H. & Chung, S. M. Orthogonal moment-based descriptors for pose shape query on 3d point cloud patches. Pattern Recognit.52, 397–409 (2016). [Google Scholar]
- 15.Calabrese, E. et al. Dhp19: Dynamic vision sensor 3D human pose dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops (2019).
- 16.Chéron, G., Laptev, I. & Schmid, C. P-cnn: Pose-based cnn features for action recognition. In Proceedings of the IEEE international conference on computer vision, 3218–3226 (2015).
- 17.Wang, L. et al. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, 20–36 (Springer, 2016).
- 18.Girdhar, R., Ramanan, D., Gupta, A., Sivic, J. & Russell, B. Actionvlad: Learning spatio-temporal aggregation for action classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, 971–980 (2017).
- 19.Diba, A., Sharma, V. & Van Gool, L. Deep temporal linear encoding networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2329–2338 (2017).
- 20.Du, Y., Wang, W. & Wang, L. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1110–1118 (2015).
- 21.Harville, M. & Li, D. Fast, integrated person tracking and activity recognition with plan-view templates from a single stereo camera. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., vol. 2, II–II (IEEE, 2004).
- 22.Roh, M.-C., Shin, H.-K. & Lee, S.-W. View-independent human action recognition with volume motion template on single stereo camera. Pattern Recognit. Lett.31, 639–647 (2010). [Google Scholar]
- 23.Wang, P., Li, W., Gao, Z., Tang, C. & Ogunbona, P. O. Depth pooling based large-scale 3-d action recognition with convolutional neural networks. IEEE Trans. Multimed.20, 1051–1061 (2018). [Google Scholar]
- 24.Jiang, Z., Rozgic, V. & Adali, S. Learning spatiotemporal features for infrared action recognition with 3d convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 115–123 (2017).
- 25.Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R. & Bajcsy, R. Berkeley mhad: A comprehensive multimodal human action database. In 2013 IEEE workshop on applications of computer vision (WACV), 53–60 (IEEE, 2013).
- 26.Kwapisz, J. R., Weiss, G. M. & Moore, S. A. Activity recognition using cell phone accelerometers. ACM SigKDD Explor. Newslett.12, 74–82 (2011). [Google Scholar]
- 27.Wang, W., Liu, A. X., Shahzad, M., Ling, K. & Lu, S. Understanding and modeling of wifi signal based human activity recognition. In Proceedings of the 21st annual international conference on mobile computing and networking, 65–76 (2015).
- 28.Chen, Y. & Xue, Y. A deep learning approach to human activity recognition based on single accelerometer. In 2015 IEEE international conference on systems, man, and cybernetics, 1488–1492 (IEEE, 2015).
- 29.Panwar, M. et al. Cnn based approach for activity recognition using a wrist-worn accelerometer. In 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2438–2441 (IEEE, 2017). [DOI] [PubMed]
- 30.Ignatov, A. Real-time human activity recognition from accelerometer data using convolutional neural networks. Appl. Soft Comput.62, 915–922 (2018). [Google Scholar]
- 31.Wang, J., Long, Q., Liu, K., Xie, Y. et al. Human action recognition on cellphone using compositional bidir-lstm-cnn networks. In 2019 International Conference on Computer, Network, Communication and Information Systems (CNCI 2019), 687–692 (Atlantis Press, 2019).
- 32.Lu, J. & Tong, K.-Y. Robust single accelerometer-based activity recognition using modified recurrence plot. IEEE Sens. J.19, 6317–6324 (2019). [Google Scholar]
- 33.Wang, Y., Zhang, F.-L. & Dodgson, N. A. Scantd: 360° scanpath prediction based on time-series diffusion. In Proceedings of the 32nd ACM International Conference on Multimedia, 7764–7773 (2024).
- 34.Reyes-Ortiz, J., Anguita, D., Ghio, A., Oneto, L. & Parra, X. Human activity recognition using smartphones. UCI Mach. Learn. Repos.10.24432/C54S4K (2013). [Google Scholar]
- 35.Kwolek, B. & Kepski, M. Human fall detection on embedded platform using depth maps and wireless accelerometer. Comput. Methods Programs Biomed.117, 489–501 (2014). [DOI] [PubMed] [Google Scholar]
- 36.Xu, F., Gao, X. & Wang, W. A human activity recognition model based on deep neural network integrating attention mechanism. Sci. Rep.15, 23192 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Bock, M., Moeller, M. & Laerhoven, K. V. Deepconvcontext: A multi-scale approach to timeseries classification in human activity recognition. ArXivarXiv:abs/2505.20894 (2025).
- 38.Liu, H. et al. Cmd-har: Cross-modal disentanglement for wearable human activity recognition. ArXivarXiv:abs/2503.21843 (2025).
- 39.Belal, M. et al. Feature fusion for human activity recognition using parameter-optimized multi-stage graph convolutional network and transformer models. In 2024 IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 1–7 (2024).
- 40.Santos, G., Endo, P., Monteiro, K., Rocha, E. & Lynn, T. Accelerometer-based human fall detection using convolutional neural networks. Sensors19, 1644 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Theodoridis, T., Solachidis, V., Vretos, N. & Daras, P. Human fall detection from acceleration measurements using a recurrent neural network. In Precision Medicine Powered by pHealth and Connected Health 145–149 (Springer, Berlin, Germany, 2018). [Google Scholar]
- 42.Kitaev, N., Kaiser, L. & Levskaya, A. Reformer: The efficient transformer. ArXivarXiv:abs/2001.04451 (2020).
- 43.Huang, N. E. et al. The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis. Proc. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci.454, 903–995 (1998). [Google Scholar]
- 44.Wu, Z. & Huang, N. E. Ensemble empirical mode decomposition: a noise-assisted data analysis method. Adv. Data Sci. Adapt. Anal.1, 1–41 (2009). [Google Scholar]
- 45.Dragomiretskiy, K. & Zosso, D. Variational mode decomposition. IEEE Trans. Signal Process.62, 531–544 (2014). [Google Scholar]
- 46.Vaswani, A. et al. Attention is all you need. In Neural Information Processing Systems (2017).
- 47.Zhou, H. et al. Informer: Beyond efficient transformer for long sequence time-series forecasting. ArXivarxiv:abs/2012.07436 (2020).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets analysed during the current study are publicly available in the UCI Machine Learning Repository under the identifier ’Human Activity Recognition Using Smartphones’ at https://archive.ics.uci.edu/dataset/240/human+activity+recognition+using+smartphones and in the UR Fall Detection Dataset repository at https://github.com/ckm-cug/ASSAFormer-URFD, with both sources referenced in the Datasets Section.













































































































































































































































































































































































































































































































































































































































































































































































































































































































































