Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Sep 29;15:33389. doi: 10.1038/s41598-025-18444-6

Adaptive temporal attention mechanism and hybrid deep CNN model for wearable sensor-based human activity recognition

Zhixue Wang 1, Kai Kang 1,
PMCID: PMC12480449  PMID: 41022984

Abstract

The recognition of human activity by wearable sensors has garnered significant interest owing to its extensive applications in health, sports, and surveillance systems. This paper presents a novel hybrid deep learning model, termed CNNd-TAm, for the recognition of both basic and complicated activities. The suggested approach enhances spatial feature extraction and long-term temporal dependency modeling by integrating Dilated convolutional networks with a modified temporal attention mechanism. Data from accelerometer and gyroscope sensors in the UTwente dataset, encompassing 13 activities and 10 people, underwent preparation that included filtering, normalization, and the selection of diverse time periods according to the activity type. Experimental findings demonstrate an accuracy of 99.4% in identifying intricate behaviors, such as conversing and consuming coffee, surpassing earlier hybrid deep learning models. This model represents a significant advancement in the development of efficient Human Activity Recognition systems by solving deficiencies in the recognition of intricate activities.

Keywords: Human activity recognition, Wearable sensors, Hybrid deep learning, Dilated CNN, Temporal attention, Complex activities

Subject terms: Engineering, Materials science, Mathematics and computing

Introduction

Human activity recognition is an essential and expanding domain within machine learning, machine vision, and digital health, utilizing data from wearable sensors, including accelerometers, gyroscopes, and magnetometers, to identify and categorize daily human activities [1, 2, 12]. This technology has extensive applications in health monitoring, sports tracking, intelligent human–machine interactions, and security systems. Recent advancements in sensor technology and enhanced computational capabilities have facilitated the collecting and processing of more intricate data [35]. The emergence of low-power, high-precision sensors in smart and wearable devices, along with advancements in computational capabilities, has enabled the acquisition and analysis of multimodal and high-dimensional sensor data [2, 5]. Nonetheless, barriers such as unpredictability in movement patterns, data noise, absent values, and the necessity to precisely characterize temporal and spatial dependencies continue to impede the attainment of high accuracy in this domain [6, 7]. These problems are especially evident in real-world settings, where sensor placements (e.g., trouser pockets, wrists) generate noise and inconsistency, as noted in the UTwente dataset [8]. Human activities are generally classified as simple activities (e.g., walking, sitting, jogging) and complicated activities (e.g., typing, consuming coffee, conversing) [8]. Basic activities demonstrate consistent and repeated patterns, facilitating their detection, whereas complex activities entail non-repetitive and context-sensitive motions that depend on extended temporal sequences, presenting considerable challenges for detection [7, 8]. The UTwente dataset, comprising 13 activities (7 simple and 6 complex), underscores practical issues such noise from vibrations and absent values resulting from sensor placement in real world scenarios [8]. Resolving these challenges necessitates comprehensive preprocessing methods, such as noise reduction, data imputation, and selection of activity-specific time windows, to guarantee dependable feature extraction and classification [6, 7, 30]. Deep learning has transformed Human Activity Recognition (HAR) by facilitating the automatic extraction of hierarchical features from raw sensor data and addressing the constraints of conventional hand-crafted feature engineering [1, 5, 12]. Convolutional neural networks proficiently capture spatial patterns, but recurrent neural networks, including long short-term memory (LSTM) and gated recurrent unit (GRU), excel in modeling temporal dependencies [6, 10]. Hybrid deep learning models that combine CNNs and RNNs have demonstrated potential in tackling both spatial and temporal dimensions of sensor data [2, 5, 12]. Nonetheless, these models frequently experience overfitting, elevated computational complexity, and inadequate emphasis on critical temporal aspects, particularly for intricate activities [6, 7, 12]. Moreover, the dependence on extensive, labeled datasets and the arduous task of annotating real world sensor data have stimulated interest in unsupervised and semi-supervised learning methodologies to alleviate the annotation burden [2, 4, 32].

This article presents a hybrid deep learning model, termed CNNd-TAm, which integrates a convolutional network for multi-scale spatial feature extraction with a modified temporal attention mechanism for effective temporal modeling. This model aims to enhance the accuracy of both simple and complicated activity recognition in the UTwente dataset by extracting spatial information from sensor data and prioritizing essential temporal features. The UTwente dataset was chosen as a suitable benchmark for evaluation because of its diverse range of activities (13 in total, comprising 7 simple and 6 complicated activities), superior data quality, and the application of wearable sensors in real-world contexts (trouser pockets and wrists) [8]. This paper offers a thorough strategy to address the issues in Human Activity Recognition (HAR) by highlighting precise data preparation, selecting suitable time windows for the specific activities, and optimizing meta-parameters. The main contributions of the research are summarized as follows.

  • Presenting a novel CNNd-TAm hybrid model that combines dilated convolutional networks (DCN) and a modified temporal attention mechanism to improve spatial feature extraction at different scales and modeling long-term temporal dependencies.

  • This research presents advanced methods for preprocessing accelerometer and gyroscope sensor data, including noise filtering, normalization, and selection of time intervals appropriate to the type of activity.

  • Focusing on extracting optimal spatial and temporal features, the proposed model overcomes the limitations of previous deep learning models, such as overfitting and high computational complexity, and provides superior performance in recognizing complex and non-repetitive activities compared to existing hybrid models

The organization of the document is as follows: The second segment reviews pertinent literature in the topic of Human Activity Recognition (HAR). The third section delineates the suggested methodology, encompassing the introduction of the UTwente dataset, preprocessing procedures, CNNd-TAm model architecture, and implementation specifics. The fourth component focuses on analyzing experimental results and comparing the model’s performance with alternative methodologies. Ultimately, the discussion, conclusions, and recommendations for further research will be given. This project seeks to provide an efficient and generalizable model to enhance human activity recognition systems in both practical and research contexts.

Related works

Substantial progress in human activity recognition (HAR) has been facilitated by deep learning methodologies that analyze multiscale sensor data from wearable devices, such as accelerometers, gyroscopes, and magnetometers [1, 2, 5, 7]. These advancements have applications in healthcare, intelligent settings, sports analytics, and human–computer interaction, necessitating robust models for the recognition of both simple and complicated activities [3, 4, 8, 25]. This section offers a detailed examination of HAR methods, utilizing deep learning and recurrent model architectures, alongside hybrid models resulting from their integration. It also investigates the implementation of attention mechanisms, emphasizing preprocessing techniques, utilized datasets, and associated challenges. This paper contributes to extensive research that has culminated in the suggested CNNd-TAm model, which mitigates the challenges of computing efficiency and intricate activity detection.

Convolutional networks are extensively utilized in Human Activity Recognition (HAR) because of their capacity to extract local spatial and spatiotemporal information from sensor data and images [6, 11]. The application of CNNs for human activity identification utilizing multiscale sensor data, smartphones, and radar was examined in [11, 29]. This study shown that CNNs excel at extracting local features from spatiotemporal data, although exhibit shortcomings in modeling long-term temporal correlations. To mitigate this constraint and tackle challenges such as overfitting and computational complexity, certain studies have integrated CNNs with MaxPooling and Dropout layers to avert overfitting and diminish processing demands. [6]. In the referenced study, [9] introduced a 2D CNN model for activity classification with the WISDM dataset, attaining an accuracy of approximately 96.1% through characteristics derived from sensor data. Likewise, [14] examined lightweight convolutional neural networks focusing on low-power embedded systems and attained good accuracy in PAMAP2 for human activity recognition (HAR). Nonetheless, these models shown constrained efficacy in recognizing intricate activities characterized by non-repetitive patterns, attributable to insufficient emphasis on critical temporal aspects [5, 11]. This study’s proposed model addresses this limitation through the utilization of Dilated convolutional layers. These layers facilitate the extraction of spatial characteristics at varying scales by expanding the field of view without augmenting processing complexity, proving particularly effective for intricate activities such as conversing and consuming coffee Numerous research has extensively integrated CNN with recurrent networks, including LSTM, GRU, BiLSTM, and BiGRU, to describe temporal dependencies in human activity recognition (HAR) [10, 15, 16, 21]. They attained remarkable outcomes. A CNN-LSTM hybrid model was suggested in [17, 24] for activity recognition with wearable sensor data, effectively modeling long-term temporal dependencies with LSTM and attaining an accuracy of approximately 95.8% on the UCI-HAR dataset. In a same manner, [6] created a CNN-GRU model validated on the WISDM dataset, integrating convolutional layers with gated recurrent units to surpass models like Inception Time and DeepConvLSTM [13]. [19] created a hierarchical CNN-BiLSTM model that extracts local spatiotemporal and global contextual characteristics, while [22] introduced a multi-branch CNN-BiLSTM model that reduces preprocessing and captures local and long-term dependencies, validated using wearable sensor data. A 1D-CNN-BiLSTM model was proposed by [18] that converts sensor data into high-level features and encodes long-range relationships using BiLSTM. [10] presented a CNN-BiGRU model with 98.89% accuracy on the UTwente dataset, demonstrating enhanced performance for intricate operations. The study [20] conducted a comparative examination of CNN hybrids using LSTM, BiLSTM, GRU, and BiGRU, attaining elevated F-scores on the PAMAP2 dataset. CNN-RNN hybrids, while successful, frequently entail substantial computing expenses and face challenges in differentiating identical activities (e.g., smoking from drinking) in the absence of attention mechanisms [10, 15], hence constraining their use on low-power devices. The CNNd-TAm model substitutes recurrent networks with an enhanced temporal attention mechanism, prioritizes activity-related data while eliminating superfluous information, and enhances performance in recognizing complicated activities by concentrating on essential temporal properties.

Attention mechanisms have garnered significant interest in Human Activity Recognition (HAR) in recent years because of their capacity to prioritize salient aspects and exclude extraneous information. In [24], a CNN-LSTM model incorporating self-attention was validated on MHEALTH and UCI-HAR, enhancing prediction by prioritizing essential temporal patterns. [23] modified a Transformer model for Human Activity Recognition (HAR) and attained 99.2% accuracy on an extensive smartphone sensor dataset by utilizing self-attention to address signal dependencies inside the model. Nonetheless, Transformer-based models are computationally demanding, rendering them less appropriate for resource-limited wearable devices [7, 23]. The hybrid learning algorithms (CMFA and CGFA) shown in [25], utilizing analogous attention mechanisms, exhibited resilience to both local and long-term dependencies within the WISDM dataset. These investigations underscore the capacity of attention mechanisms to enhance model interpretability and performance; nonetheless, their computational burden persists as a concern. The CNNd-TAm model incorporates a streamlined temporal attention mechanism aimed at optimizing efficiency and accuracy for practical human activity recognition applications.

In [36], a unique technique for HAMR is developed that exploits the attention mechanism with multi-head convolutional neural networks (CNN) and long-short-term memory (LSTM). The accuracy of activity detection in the proposed method is increased by integrating attention into multi-head CNNs followed by LSTM for better feature extraction and selection. However, adopting an activity detection model in practical scenarios confronts two fundamental hurdles. First, machine learning models utilize a huge amount of labeled data to recognize human activities, which is not always achievable in real-world circumstances. Second, present human activity detection (HAR) systems cannot dynamically adjust to a new action. Furthermore, present approaches fail to distinguish short-term activities from heterogeneous smart devices with diverse placements and orientations that have comparable sensory reading patterns. To solve these challenges, the Flexi-HAMR technique is proposed in [37], an intelligent adaptive human activity monitoring and detection system that dynamically recognizes activities utilizing online and real-time activity data.

These studies collectively highlight deficiencies in complicated activity recognition, computing efficiency, and practical applicability. The suggested CNNd-TAm model is enhanced through the incorporation of a modified temporal attention mechanism, integrating Self-Attention and General Attention, to diminish computing complexity while preserving high accuracy. This approach enables the model to more efficiently capture long-term temporal connections, particularly in intricate tasks necessitating the study of extended motion sequences. Table 1 contrasts the CNNd-TAm architecture with analogous models, emphasizing key improvements such as Dilated convolutions, BiGRU, and a parallel dual attention mechanism that enhances accuracy and efficiency for intricate tasks.

Table 1.

Comparison of CNNd-TAm with related HAR architectures.

Model References CNN type rnn type Attention mechanism Target challenge Limitations
CNN-LSTM [24] Standard 1D CNN LSTM Sequential self-attention Long-term dependencies High computational cost, limited multi-scale feature extraction
CNN-GRU [13] Standard 1D CNN GRU None Temporal modeling Limited focus on critical temporal frames
CNN-BiLSTM [18] Standard 1D CNN BiLSTM None Local and global dependencies Ineffective for non-repetitive patterns
CNN-BiGRU [10] Standard 1D CNN BiGRU None Complex activity recognition Limited differentiation of similar activities (e.g., drinking vs. smoking)
Transformer-based [23] None None Transformer self-attention Long-range dependencies Computationally intensive, unsuitable for wearables
CNNd-TAm (Proposed) This study Dilated CNN (rates: 1, 2, 4) BiGRU (128 units) Parallel self-attention + general attention Complex activity recognition, similar activity differentiation Higher complexity than simple CNNs

Proposed approach

Dataset description

  • The UTwente dataset consists of 13 activities gathered from 10 people utilizing two smartphones (one in the right trouser pocket and the other on the wrist), equipped with accelerometer, gyroscope, linear acceleration, and magnetometer sensors at a frequency of 50 Hz.

  • The UCI-HAR dataset comprises six fundamental activities (walking, walking upstairs, walking downstairs, sitting, standing, lying) performed by 30 people utilizing a waist-mounted smartphone equipped with accelerometer and gyroscope sensors at a frequency of 50 Hz. This dataset serves as a typical benchmark for Human Activity Recognition (HAR) owing to its equitable distribution of activity classes and extensive participant pool.

  • The OPPORTUNITY dataset comprises 17 activities, including 5 locomotion and 12 gesture-based activities, recorded from 4 people using various wearable sensors (accelerometer, gyroscope, and inertial measurement units) at a frequency of 30 Hz. It is optimal for assessing intricate actions in authentic environments.

Preprocessing

A uniform preprocessing pipeline was applied across all datasets to ensure consistency and reliability. To mitigate high-frequency noise, a third-order Butterworth low-pass filter with a cutoff frequency of 20 Hz was employed. Missing values were imputed using linear interpolation to preserve the continuity of time-series data. Min–Max normalization was subsequently applied to scale all data within the range of [0, 1]. For temporal segmentation, activity-specific window sizes with 50% overlap were utilized: 20 s for simple, 40 s for complex, and 30 s for mixed activities in the UTwente dataset; a fixed 20 s window for all activities in the UCI-HAR dataset due to their low complexity; and a 40 s window for the OPPORTUNITY dataset to effectively capture complex, gesture-based movements. Finally, a tenfold cross-validation strategy was adopted to ensure robust and generalizable performance evaluation.

The data samples are partitioned into two primary sets: training data and test data. A prevalent method in this domain is cross validation. Cross validation enhances evaluation accuracy and diminishes reliance on arbitrary data division. This method involves the iterative division of data into training and test sets to enhance the optimality and reliability of model evaluation. K-fold cross validation is a commonly employed technique. The value of k is set to 10. In each iteration, one segment serves as the test set while the remaining segments constitute the training set. This procedure is executed k times, and the final outcomes are determined by average the model’s performance throughout all iterations [7, 26].

The primary objective of this phase is to assess the learning algorithm’s capacity to generalize to novel and unfamiliar material. The significance in the domain of human activity recognition (HAR) lies in the necessity for the model to accurately recognize diverse activities under various settings.

Alongside cross validation, selecting appropriate temporal window sizes and data pretreatment techniques significantly influences the model’s ultimate performance. The duration of the time frame must be selected to yield sufficient data from the signals for distinguishing various actions, while concurrently reducing noise and superfluous information. Furthermore, employing data standardization or normalization methods can enhance input quality and expedite model training [10, 31]. Figure 1 presents a summary of the proposed methodology.

Fig. 1.

Fig. 1

The proposed framework of S-HAR for CHA recognition.

Proposed model architecture

The CNNd-TAm hybrid model utilizes deep learning, integrating a 1D convolutional architecture, Dilated CNN-1D, with an enhanced temporal attention mechanism, Temporal Attention. Figure 2 illustrates the internal architecture of the suggested paradigm. The subsequent layer processes the output of the preceding layer. It comprises a total of thirteen levels. This section succinctly elucidates the internal architecture of the proposed work and analyzes each layer of the hybrid deep network.

Fig. 2.

Fig. 2

Architecture of the proposed CNNd-TAm model.

Convolutional layers

Convolutional layers comprise three one-dimensional CNN layers that extract spatial and local information from the network’s input sensor data. The preprocessed data is initially input into the CNN layers. This architecture employs a Dilated convolutional layer, CNNd, which enhances the filter’s receptive field without augmenting the parameter count, while the spacing between filter values remains constant in size. This approach offers an expanded field of view at equivalent computing expense, and by encompassing a larger area with each convolution operation, it yields more information from the output; nevertheless, augmenting the filter size results in a rise in both the number of parameters and computational complexity. The Dilated convolutional layer possesses a dilation rate d, signifying that d-1 pixels of the filter are augmented. The initial layer of this segment is a 1D convolutional layer featuring a filter size of 128 and a dilation rate of 1, equivalent to standard convolution, succeeded by a 1D max pooling layer. Every CNN layer is succeeded by a Maxpooling layer to execute downsampling operations and diminish the dimensionality. It aims to achieve two objectives. One objective is to minimize the parameters while maintaining the salient features, and the other is to eliminate the extraneous noise generated by the involuntary vibrations of the human body. Subsequently, we will incorporate two additional 1D deletional convolutional layers; however, the output from the third dilatational convolutional layer will be directed to the Dropout layer prior to proceeding to the max pooling layer to mitigate overfitting. The second and third convolutional layers have 64 and 32 filters, respectively, with dilation rates of 2 and 4. At the conclusion of the dilatational CNN levels, a Flatten layer is utilized to transform the data output into a one-dimensional vector, serving as input for the Temporal Attention layer.

A GeLU activation function is utilized in Conv1D layers. This function is nonlinear in the positive domain and displays curvature at every point. Consequently, enhanced curvature and irregularity may enable GELUs to more effectively simulate intricate functions compared to ReLUs or ELUs [33]. Additionally, ReLU modulates the input based on its sign, whereas GELU assesses the significance of its input relative to other inputs, hence preventing outputs near zero from becoming zero. The nonlinear activation function is defined as follows:

graphic file with name d33e625.gif 1
graphic file with name d33e631.gif 2

Temporal attention layer

Sensor data typically has extensive information, much of which may be irrelevant for identifying human activities; for instance, an individual may walk for an extended duration, although only segments of the sensor data throughout this period possess critical properties for activity recognition. The remaining data comprises superfluous features that may induce computational burden and diminish the model’s accuracy. The Attention layer is explicitly engineered to emphasize the more significant elements of the data. This layer in the proposed model architecture serves to prioritize the most significant characteristics from the CNN output [28]. This enables the modeling of temporal dependencies based on the more significant traits. The Attention technique enables the model to concentrate more effectively on sparsely represented or intricate classes.

The selection of attention type is contingent upon the data structure, the model’s objective, and the interrelations among the characteristics. Given that sensor data is structured as a time series, the use of temporal attention can prove highly successful [34]. It enables the model to concentrate on particular time intervals of an activity that hold greater significance. It enhances the model’s accuracy by enabling it to learn more from significant temporal features. This may enhance predictive accuracy.

This attention structure [34] comprises two mechanisms: Self-Attention and General Attention, in conjunction with a long-term memory network and a short-term memory network (LSTM) for time series modeling. This framework enhances prediction accuracy and facilitates the comprehension of model outcomes. The primary elements of its architecture are illustrated in Fig. 3a. The enhanced architecture has been revised by substituting BiGRU with LSTM, as detailed in the structural description, illustrated in Fig. 3b.

Fig. 3.

Fig. 3

(a). Temporal Attention Architecture (b). Modified Temporal Attention Architecture.

In light of the modifications to the structure of the Temporal Attention model utilizing the BiGRU recurrent layer, our proposed revised structure (Fig. 3b) comprises four essential components as outlined below:

Input layer: responsible for receiving time series data from the output of the convolutional layers for processing in subsequent layers.

The BiGRU layer: comprising two gated recurrent units GRU Fig. 4a that process data in both forward and backward directions, collects long-term temporal dependencies from the output of the convolutional layer. In contrast to LSTM, BiGRU’s optimized architecture featuring update and reset gates facilitates the fast processing of time-series sensor input, hence improving the model’s capacity to capture sequential patterns in actions such as typing or speaking Fig. 4b. The concealed state at time t amalgamates present input and preceding/subsequent states, guaranteeing effective feature extraction for the ensuing dual attention mechanism. At each time step t, the BiGRU network generates an output vector Inline graphic that synthesizes information from both directions. These interactions are expressed mathematically as:

graphic file with name d33e717.gif 3
graphic file with name d33e723.gif 4
graphic file with name d33e729.gif 5
graphic file with name d33e735.gif 6
Fig. 4.

Fig. 4

(a). GRU recurrent network architecture and (b). Bidirectional GRU.

Two parallel Attention layers: In contrast to RNNs, which primarily influence proximate elements, Self-Attention is capable of modeling long-range dependencies, selectively enhancing significant input while disregarding extraneous information. Certain segments of the sensor data hold greater significance than others, such as the apex of movement during an activity. Self-Attention discerns these points and assigns them greater significance. Self-Attention discerns and characterizes the interrelations among various segments of the temporal sequence. This layer can identify significant temporal points that are pivotal in the analysis of human activities.

Procedural phases in Self-Attention As illustrated in Fig. 5a, each time frame in the input sequence is transformed into three vectors: Q, K, and V. These vectors quantify the degree of correlation of a point’s information with that of other points. The inner product of Q and K is utilized to compute the similarity value between time frames, determining the extent of attention each frame allocates to other frames. The subsequent relationship illustrates the processing of these vectors, as depicted in the picture.

graphic file with name d33e761.gif 7
Fig. 5.

Fig. 5

(a) Self-Attention Network Architecture (b). General-Attention Network Architecture.

In this equation, the variable Q specifies the information sought by each time frame, whereas K identifies the pertinent information among the values of Q that should be prioritized for each time frame. The SoftMax output reflects the attention each frame allocates to other frames and is multiplied by V, which comprises the weighted amalgamation of the Inline graphic feature vector from each input, ultimately dictating the information relayed to other layers. The computation of each is predicated on the output of BiGRU, which serves as the input to the Self-Attention mechanism as follows.

graphic file with name d33e775.gif 8
graphic file with name d33e781.gif 9
graphic file with name d33e787.gif 10
graphic file with name d33e793.gif 11

d represents the dimension of the feature vector in the attention space, or Attention Dimension. This value is usually equal to the number of features in each vector in the vector matrices Q, K, V, which is equal to the dimension of the BiGRU output, and is placed in the denominator of the fraction to maintain gradient stability during training.

General Attention: In contrast to Self-Attention, which discerns the internal relationships across time frames, General Attention emphasizes the overall significance of each segment of the sequence. This layer identifies the segments of the temporal sequence that exert the greatest influence on the ultimate forecast. It allocates more significance to essential components while diminishing the importance of other elements and detects the locations that most significantly influence the output, irrespective of the temporal relationships between frames. It identifies the segments of the sequence that the model prioritizes and the components of the data that influence decision-making.

Procedural phases in General Attention Fig. 5b illustrates that we initially employ a two-layer MLP network to compute the attention score, primarily aimed at assessing the significance of each time frame in the sequence, thereby enabling the selection of crucial characteristics for the final output.

graphic file with name d33e810.gif 12

In this equation, Inline graphic represents the specific output of the BiGRU recurrent network, Inline graphic is the weight of the first layer, and Inline graphic signifies its bias. The output of this layer is input into a hyperbolic tangent activation function (tanh) in the second layer to model nonlinear dependencies. In this layer, Inline graphic and Inline graphic represent the weight and bias of the second layer, respectively. To determine the relative significance of each time frame in relation to others, the attention scores of all frames are processed through Softmax, yielding Inline graphic, the attention coefficient for time frame t, with the total of the attention coefficients equaling 1.

graphic file with name d33e855.gif 13

In the subsequent phase, the features are amalgamated with attention coefficients, amplifying the salient characteristics and attenuating the less significant frames:

graphic file with name d33e863.gif 14

where A represents a vector that serves as a weighted summary of the complete temporal sequence and encompasses critical information.

Attention Output Connection Layer: Self-Attention identifies dependencies among time frames and extracts significant aspects in temporal relationships, while General Attention assesses the importance of frames in the sequence, ensuring that critical ones are not overlooked.

The connection layer of these two outputs must generate a final vector that incorporates information from both processes. The outputs of the two Attention mechanisms are retained as distinct vectors, which are subsequently amalgamated in an additive fashion to form a unified vector. The aggregated vector is transmitted to the ultimate output layers. The dimensions of both are identical to those of the recurrent layer, resulting in the combined vector’s dimensions being double that of the original.

Final output layers

At this juncture, following the amalgamation of the outputs from both Attention mechanisms, the quantity of final vector features is substantial. To mitigate computational complexity, a linear dense layer employing the “σ” function, which is a GELU activation function, is utilized to alleviate the vanishing gradient issue within the network and facilitate expedited gradient convergence.

graphic file with name d33e880.gif 15

A further problem with deep neural networks is overfitting. A Dropout layer is employed to address this issue. A percentage of the network’s neurons are randomly silenced to enhance the model’s generalizability to novel data. The last layer employs the SoftMax function to transform the model output into a probability distribution across activity classes, with each value representing the likelihood of the input corresponding to a particular class of activities. The formula for the expression is as follows:

graphic file with name d33e888.gif 16

The loss function in the Human Activity Recognition (HAR) task utilizing the UTwente dataset often constitutes a multi-class classification problem. The Cross-Entropy Loss function, abbreviated as CEL, is appropriate for multi-class categorization.

graphic file with name d33e896.gif 17

In Eq. 1, N represents the number of samples, C denotes the number of classes, Inline graphic signifies the real number for sample i and class j, and Inline graphic indicates the predicted probability for instance i in class j determined by the model output via the SoftMax function.

The CNNd-TAm architecture’s innovation and unique features

Innovations in architecture

This portion emphasizes the distinctive contribution of CNNd-TAm relative to current hybrid architectures for human activity recognition (HAR), particularly in tackling the complexities of intricate activities. Previous studies, including [10, 18, 23, 24], have amalgamated convolutional neural networks (CNN), recurrent neural networks (RNN), and attention mechanisms; however, CNNd-TAm presents a specialized integration of Dilated convolutions, bidirectional gated recurrent units (BiGRU), and an enhanced temporal attention mechanism that significantly enhances performance for intricate activities such as typing, drinking coffee, and smoking within the UTwente and OPPORTUNITY datasets encompassing these activities.

The CNNd-TAm model comprises three essential components, each featuring distinct enhancements.

  • Dilated Convolutional Layers with Adaptive Dilation Rates: In contrast to conventional CNNs referenced in [9, 14], which utilize fixed kernel sizes, CNNd-TAm implements Dilated convolutions featuring progressively increasing dilation rates (1, 2, 4) across three layers comprising 128, 64, and 32 filters, respectively. This design enhances the receptive field to capture multi-scale spatial patterns while maintaining computational efficiency, effectively addressing the challenge of modeling non-repetitive and variable motion patterns in complex activities, such as differentiating hand movements in typing and writing. In contrast, models such as CNN-LSTM [24] and CNN-BiLSTM [18] utilize standard convolutions, which restrict their capacity to effectively capture a range of spatial scales.

  • BiGRU for Bidirectional Temporal Modeling: Previous hybrid models, including CNN-BiGRU [10] and CNN-BiLSTM [18], utilize bidirectional recurrent layers; in contrast, CNNd-TAm implements a simplified BiGRU with 128 units specifically optimized for human activity recognition (HAR). The architecture of BiGRU is simpler than that of BiLSTM, resulting in reduced computational overhead while effectively modeling long-term dependencies, which is essential for complex activities involving extended temporal sequences, such as conversational gestures. This differs from [24], in which LSTM-based models entail greater computational expenses.

  • Modified Temporal Attention Mechanism: The primary innovation of CNNd-TAm is its modified temporal attention mechanism, which integrates Self-Attention and General Attention in a parallel arrangement, succeeded by an additive connection layer. In contrast to the self-attention mechanism in [24], which emphasizes intra-sequence dependencies, and the hierarchical attention in [19], which organizes global context hierarchically, CNNd-TAm employs a dual attention strategy that concurrently addresses local temporal relationships through Self-Attention and global sequence significance through General Attention. This hybrid attention mechanism is designed to tackle the challenge of complex activities by prioritizing critical temporal frames, such as peak hand movements during drinking, while preserving context throughout the entire sequence, including conversational patterns.

Particular improvements to the temporal attention mechanism modified

The modified temporal attention mechanism in CNNd-TAm presents distinct characteristics compared to existing approaches in several aspects:

Parallel Self-Attention and General Attention: In contrast to [24], which applies self-attention sequentially following LSTM, CNNd-TAm executes Self-Attention and General Attention concurrently, facilitating the simultaneous modeling of local (e.g., specific motion peaks) and global (e.g., overall activity context) temporal features. The Self-Attention component utilizes scaled dot-product attention, as described in [23], to derive Query (Q), Key (K), and Value (V) vectors from BiGRU outputs. This process identifies inter-frame dependencies and incorporates a scaling factor (\sqrt{d}), where (d = 128), to stabilize gradients. The General Attention component employs a two-layer MLP with Tanh activation to allocate importance scores to each frame, highlighting globally significant segments. This parallel design minimizes redundancy and improves efficiency relative to the sequential attention in [24] or the resource-intensive Transformer-based attention in [23].

Additive Connection Layer: The outputs of Self-Attention and General Attention are combined additively to create a unified feature vector, resulting in a dimensionality increase from 128 to 256, thereby encapsulating both local and global temporal information. This differs from [19], in which hierarchical attention sequentially aggregates features, possibly resulting in the loss of fine-grained temporal details. The additive connection preserves nuanced patterns in complex activities, such as subtle hand movements in smoking compared to drinking.

Enhanced for Complex Activities: The dual attention mechanism is designed to effectively tackle the challenges posed by complex activities characterized by non-repetitive and context-sensitive patterns. Activities such as drinking and smoking in the UTwente dataset exhibit analogous hand-to-mouth motions; however, they differ in their temporal dynamics, including the frequency and duration of movements. The Self-Attention component identifies local temporal variations, whereas General Attention highlights critical frames (e.g., the action of lifting a cup), thereby minimizing confusion among similar activities. This leads to more reduced error rates in the confusion matrix (Fig. 8) when compared to CNN-BiGRU [10] and CNN-LSTM [24].

Fig. 8.

Fig. 8

Comparison of (a) training accuracy and (b) training loss for Hybrid Deep learning models on the OPPORTUNITY dataset for complex activities over 200 epochs.

Handling the difficulties of complex activities

The CNNd-TAm model addresses key challenges in activity recognition by integrating Dilated convolutions to capture non-repetitive motion patterns, overcoming the limitations of conventional CNNs with fixed receptive fields [9, 14]. A BiGRU layer, combined with a dual attention mechanism, enables effective modeling of long-term dependencies, outperforming LSTM-based models in extended sequences. Moreover, replacing LSTM with BiGRU and using an optimized attention mechanism reduces computational cost, making CNNd-TAm more efficient than Transformer-based [23] and hierarchical models [19], and suitable for deployment on resource-constrained wearable devices without compromising accuracy.

Analysis of experiments and results

Experiment setup and evaluation criteria

Model implementation

The suggested hybrid model was executed utilizing Python 3.8, TensorFlow 2.10 frameworks, and the high-level Keras neural network API that can operate on TensorFlow and Theano, facilitating model training and classification on Google Colab, which offers access to computing resources and storage of the UTwente dataset on the Google platform at no cost. This setting is appropriate for the training and assessment of deep learning models. The model was implemented in the Google Colab environment with NVIDIA’s 4T Tesla GPU, which features 16 GB of RAM and CUDA support to enhance training efficiency.

Evaluation criteria

This study aims to identify human activities through accelerometer and gyroscope data; thus, accuracy is an inadequate metric for comparison with other human activity recognition models, as it merely reflects the ratio of correctly classified activities to the total activities in the test dataset.

graphic file with name d33e1050.gif 18

Precision quantifies the ratio of accurately categorized positive instances to the total of both true and incorrect positive classifications.

graphic file with name d33e1058.gif 19

The Recall metric is the proportion of true positive activities relative to the sum of true positive and false negative activities.

graphic file with name d33e1066.gif 20

The F1 score represents the weighted average of Precision and Recall. Consequently, this score considers both false positives and false negatives. It amalgamates the two metrics established on the total count of accurately identified samples, rendering it an appropriate measure for comparison.

graphic file with name d33e1074.gif 21

in contrary, as human activity recognition constitutes a classification problem, the confusion matrix is a square matrix with k classes employed to yield precise outcomes for multi-class classification challenges. The confusion matrix illustrates a classifier’s performance by presenting the correlation between real and anticipated classes in a tabular format [7, 10]. It offers a more thorough and precise evaluation of the accurate and erroneous classification categories of supervised learning models.

graphic file with name d33e1087.gif 22

Consequently, this study employs five criteria: classification accuracy, precision, recall, F1 score from the classification report, and the confusion matrix to assess the models [5, 7, 11, 31].

Hyperparameters tunings

Hyperparameters adjustment is crucial for the development and optimization of the suggested model in activity identification. This article employed a hybrid approach of manual and network methods to ascertain the optimal values of the hyperparameters. The technique was validated by assessing starting ranges for essential hyperparameters and adjusting them to enhance model accuracy and learning process stability.

The learning rate was evaluated within the range of Inline graphic to Inline graphic to identify the maximum rate that would ensure a consistent reduction in loss, as observed through the loss function during training.

The optimal value of 0.001 for the Ada m optimizer was established, offering an effective equilibrium between learning velocity and stability. The batch size, which dictates the number of samples processed per iteration, was evaluated at 16, 32, 64, and 128. A batch size of 64 yielded the optimal accuracy. By selecting these hyperparameters, which minimized gradient fluctuations and averted overprocessing, the model exhibited stability and optimized computational resource utilization.

The hyperparameters of the Dilated convolutional layers were meticulously optimized. The filter sizes for the first, second, and third layers were 32, 64, and 128, respectively, with dilation rates of 2, 1, and 4 to augment the receptive field without elevating computing complexity. These values facilitated the extraction of intricate latent features from the UTwente sensor data. MaxPooling layers of size 2 were employed to diminish dimensionality and eliminate noise, while a Dropout rate of 0.3 was included in the third layer and the output layer to avert overfitting.

In the Temporal Attention section, the BiGRU recurrent unit can be configured with 128 units to proficiently model long-term dependencies. The attention space dimension of 128 was used to ensure a balance of computational complexity. GeLU activation functions were used for convolutional layers and Tanh for General Attention because of their capacity to simulate linear latent associations. The training cycles were established at 200, employing tenfold cross validation to assess the model’s generalizability. This paper presents an overview of the meta-parameters for the CNNd-TAm networks in Table 2.

Table 2.

Summary of the meta-parameters for the CNNd-TAm networks proposed.

Layer or process Meta parameter Selected value
Convolutional layers Number of layers 3
CNN filters size 64, 128, 256
CNN kernel size 3, 5, 7
Dilation rates 1, 2, 4
Activation function GLU
Dropout rate 0.3
Temporal attention Attention type Self-Attention + General- Attention
Activation function general- attention Tanh
Dimensions of attention 128
Attention output function SoftMax
Recurrent unit BiGRU (128)
Output function Dense layer 128
Dropout rate 0.3
Loss function Cross-Entropy Categorical
Output function SoftMax
Training tuning Learning rate 0.01
Optimizer Adam
Batch size 64
Epochs 200
Validation method tenfold Cross Validation

Experiments and results

Comparative analysis with hybrid deep learning models

The consistent performance of the CNNd-TAm model across UTwente, UCI-HAR, and OPPORTUNITY highlights its generalizability. It attains elevated accuracies notwithstanding variances in activity types, sensor configurations, and sampling rates. The Dilated convolutional layers extract multi-scale spatial characteristics, whereas the temporal attention mechanism emphasizes essential temporal patterns, overcoming the shortcomings of baseline models such as CNN-LSTM and CNN-GRU, which encounter difficulties with intricate actions on OPPORTUNITY (error rates 4–6%). The model’s resilience is additionally demonstrated by minimal error rates in the confusion matrices (Table 35).

Table 3.

Comparison of model performance for combined activities across datasets.

Model Dataset Accuracy (%) Precision (%) Recall (%) F1 Score (%)
CNNd-TAm (Proposed) UTwente 99.15 100.00 98.65 98.82
CNNd-TAm (Proposed) UCI-HAR 99.10 99.60 98.90 99.25
CNNd-TAm (Proposed) OPPORTUNITY 98.50 99.00 98.00 98.50
CNN-BiGRU [10] UTwente 98.75 100.00 97.35 98.46
CNN-BiGRU [10] UCI-HAR 98.30 98.90 97.80 98.35
CNN-BiGRU [10] OPPORTUNITY 97.70 98.20 97.00 97.60
CNN-BiLSTM [18] UTwente 97.73 98.05 93.12 95.28
CNN-BiLSTM [18] UCI-HAR 97.90 98.50 97.40 97.95
CNN-BiLSTM [18] OPPORTUNITY 97.00 97.60 96.70 97.15
CNN-GRU [13] UTwente 97.92 98.60 93.48 95.68
CNN-GRU [13] UCI-HAR 97.70 98.30 97.20 97.75
CNN-GRU [13] OPPORTUNITY 96.80 97.40 96.30 96.85
CNN-LSTM [24] UTwente 97.65 99.70 91.15 94.90
CNN-LSTM [24] UCI-HAR 97.40 98.10 96.80 97.45
CNN-LSTM [24] OPPORTUNITY 96.40 97.10 95.90 96.50
Transformer [23] UCI-HAR 98.60 99.20 98.10 98.65
Transformer [23] OPPORTUNITY 97.80 98.40 97.30 97.85
CMFA [25] UCI-HAR 98.00 98.70 97.50 98.10
CMFA [25] OPPORTUNITY 97.20 97.80 96.80 97.30
Table 5.

Comparison of model performance for simple activities (20 s window).

Model Dataset Accuracy (%) Precision (%) Recall (%) F1 Score (%)
CNNd-TAm (Proposed) UTwente 99.75 100.00 99.05 99.52
CNNd-TAm (Proposed) UCI-HAR 99.65 99.85 99.30 99.57
CNN-BiGRU [10] UTwente 99.40 100.00 98.45 99.18
CNN-BiGRU [10] UCI-HAR 99.20 99.60 98.80 99.20
CNN-BiLSTM [18] UTwente 99.82 100.00 99.28 99.62
CNN-BiLSTM [18] UCI-HAR 99.40 99.70 99.10 99.40
CNN-GRU [13] UTwente 99.18 100.00 97.60 98.74
CNN-GRU [13] UCI-HAR 99.00 99.40 98.60 99.00
CNN-LSTM [24] UTwente 98.70 99.10 95.40 97.00
CNN-LSTM [24] UCI-HAR 98.60 99.20 98.30 98.75
Transformer [23] UCI-HAR 99.30 99.60 99.00 99.30
CMFA [25] UCI-HAR 98.90 99.30 98.60 98.95
Experiment 1: combined activities

This experiment assesses all activities in UTwente (13 activities), UCI-HAR (6 activities), and OPPORTUNITY (17 activities) utilizing 30-s, 20-s, and 40-s windows, respectively, with a 50% overlap. The results are displayed in Table 3.

Experiment 2: complex activities

This investigation examines intricate activities at UTwente (typing, writing, consuming coffee, conversing, smoking, eating) and OPPORTUNITY (12 gesture-based actions) utilizing a 40 s interval with a 50% overlap. UCI-HAR is excluded due of its sole focus on basic activities. The results are presented in Table 4.

Table 4.

Comparison of model performance for complex activities (40 s window).

Model Dataset Accuracy (%) Precision (%) Recall (%) F1 Score (%)
CNNd-TAm (Proposed) UTwente 99.42 100.00 99.20 99.60
CNNd-TAm (Proposed) OPPORTUNITY 98.30 98.80 97.90 98.35
CNN-BiGRU [10] UTwente 98.87 100.00 98.93 99.46
CNN-BiGRU [10] OPPORTUNITY 97.50 98.10 96.90 97.50
CNN-BiLSTM [18] UTwente 98.85 100.00 98.88 99.44
CNN-BiLSTM [18] OPPORTUNITY 96.90 97.50 96.60 97.05
CNN-GRU [13] UTwente 97.62 98.05 98.85 98.44
CNN-GRU [13] OPPORTUNITY 96.60 97.20 96.00 96.60
CNN-LSTM [24] UTwente 98.48 98.95 98.81 98.88
CNN-LSTM [24] OPPORTUNITY 96.20 96.90 95.80 96.35
Transformer [23] OPPORTUNITY 97.40 98.00 96.80 97.40
CMFA [25] OPPORTUNITY 96.80 97.40 96.50 96.95
Experiment 3: simple activities

This experiment assesses seven activities at UTwente and six activities at UCI-HAR are simple that utilizing a 20-s window with a 50% overlap. OPPORTUNITY is eliminated because to its emphasis on intricate activities. The results are presented in Table 5.

Confusion matrix analysis

This study evaluates the classification performance of CNNd-TAm relative to Hybrid Deep Learning models on intricate activities, with confusion matrices for the UTwente and OPPORTUNITY datasets illustrated in Figs. 6 and 7, respectively. These matrices emphasize intricate actions (e.g., typing, drinking, conversing) because of their demanding nature, Therefore, we evaluate this criterion on these activities. For UTwente, The CNNd-TAm confusion matrix demonstrated supremacy with negligible errors in intricate tasks such as conversing and consuming coffee Fig. 6. The model utilized the attention mechanism to extract essential temporal information and minimized classification mistakes in activities exhibiting analogous movements. The 40-s interval permitted the documentation of prolonged movement sequences, whereas the attention mechanism mitigated inaccuracies in activities with analogous motions. CNN-BiGRU and CNN-BiLSTM exhibited commendable performance with little errors (2–4%), but CNN-GRU and CNN-LSTM shown inefficiency with elevated errors (5–8%). CNNd-TAm exhibits strong performance for OPPORTUNITY, achieving an accuracy of 98.30 ± 0.25%, with reduced error rates in context-rich activities (e.g., drinking) relative to other models’ rates of 4–6%. The Dilated CNN and dual attention mechanism in CNNd-TAm guarantee enhanced classification, as seen by elevated diagonal values in the matrices.

Fig. 6.

Fig. 6

Confusion matrices for complex activities on UTwente dataset.

Fig. 7.

Fig. 7

Confusion matrices for complex activities on OPPORTUNITY dataset.

Comparative analysis with conventional and standard models

In response to the reviewer’s request for a comparison with traditional deep learning models, we assessed the proposed CNNd-TAm model against standard architectures, VGG-16 [39] and ResNet-50 [40], as well as baseline HAR-specific models (CNN-LSTM [24], CNN-GRU [13], CNN-BiLSTM [18], CNN-BiGRU [10]). This comparison underscores the performance enhancements of CNNd-TAm, specifically its capacity to collect multi-scale spatial characteristics and long-term temporal relationships essential for human activity recognition (HAR). To ensure clarity and brevity, results from all datasets (UTwente, UCI-HAR, OPPORTUNITY) and activity categories (mixed, simple, complex) are aggregated into a singular table, emphasizing essential measures.

Adaptation of VGG and ResNet for HAR
  • VGG-16: Modified from its 16-layer image classification framework by substituting 2D convolutions with 1D convolutions to analyze time-series sensor data (accelerometer and gyroscope). The model included 128 to 512 filters, max pooling, and fully connected layers. The intricate architecture heightens computational complexity and poses a danger of overfitting for smaller datasets such as UTwente and OPPORTUNITY.

  • ResNet-50: Modified with 1D convolutions and residual blocks (64–256 filters) to address vanishing gradients. A global average pooling layer and a dense output layer were employed for classification. ResNet-50 has greater robustness than VGG-16, however it is less tuned for temporal dependencies in human activity recognition (HAR).

  • Execution Specifications: Models were executed in Python 3.8 utilizing TensorFlow 2.10 and Keras on Google Colab with an NVIDIA Tesla T4 GPU (16 GB RAM, CUDA compatibility). Hyperparameters configured for CNNd-TAm include a learning rate of 0.001, a batch size of 64, a total of 200 epochs, the Adam optimizer, and tenfold cross-validation. Preprocessing (Butterworth filter, Min–Max normalization, linear interpolation) and windowing (20 s for basic activities, 40 s for complicated activities, 30 s for combined activities) adhered to Section “Evaluation criteria”.

Comparative results

Table 6 demonstrates the performance of CNNd-TAm, VGG-16, ResNet-50, and baseline models across UTwente (13 activities), UCI-HAR (6 simple activities), and OPPORTUNITY (17 activities, predominantly complicated) for combined, simple, and complex activities. Metrics include accuracy, precision, recall, and F1 score, with activity-specific windowing as indicated.

Table 6.

Consolidated comparison of model performance across datasets and activity types.

Model Dataset Activity Type Accuracy (%) Precision (%) Recall (%) F1 Score (%)
CNNd-TAm (Proposed) UTwente Combined 99.15 100.00 98.65 98.82
CNNd-TAm (Proposed) UTwente Simple 99.75 100.00 99.05 99.52
CNNd-TAm (Proposed) UTwente Complex 99.42 100.00 99.20 99.60
CNNd-TAm (Proposed) UCI-HAR Simple 99.65 99.85 99.30 99.57
CNNd-TAm (Proposed) OPPORTUNITY Complex 98.30 98.80 97.90 98.35
VGG-16 [39] UTwente Combined 96.80 97.50 95.90 96.70
VGG-16 [39] UTwente Simple 97.30 97.90 96.80 97.35
VGG-16 [39] UTwente Complex 95.50 96.20 94.80 95.50
VGG-16 [39] UCI-HAR Simple 97.60 98.20 97.10 97.65
VGG-16 [39] OPPORTUNITY Complex 94.70 95.40 94.00 94.70
ResNet-50 [40] UTwente Combined 97.40 98.10 96.70 97.40
ResNet-50 [40] UTwente Simple 97.90 98.50 97.40 97.95
ResNet-50 [40] UTwente Complex 96.20 96.90 95.60 96.25
ResNet-50 [40] UCI-HAR Simple 98.10 98.70 97.80 98.25
ResNet-50 [40] OPPORTUNITY Complex 95.30 96.00 94.70 95.35
CNN-BiGRU [10] UTwente Combined 98.75 100.00 97.35 98.46
CNN-BiGRU [10] UTwente Simple 99.40 100.00 98.45 99.18
CNN-BiGRU [10] UTwente Complex 98.87 100.00 98.93 99.46
CNN-BiGRU [10] UCI-HAR Simple 99.20 99.60 98.80 99.20
CNN-BiGRU [10] OPPORTUNITY Complex 97.50 98.10 96.90 97.50
CNN-BiLSTM [18] UTwente Combined 97.73 98.05 93.12 95.28
CNN-BiLSTM [18] UTwente Simple 99.82 100.00 99.28 99.62
CNN-BiLSTM [18] UTwente Complex 98.85 100.00 98.88 99.44
CNN-BiLSTM [18] UCI-HAR Simple 99.40 99.70 99.10 99.40
CNN-BiLSTM [18] OPPORTUNITY Complex 96.90 97.50 96.60 97.05
CNN-GRU [13] UTwente Combined 97.92 98.60 93.48 95.68
CNN-GRU [13] UTwente Simple 99.18 100.00 97.60 98.74
CNN-GRU [13] UTwente Complex 97.62 98.05 98.85 98.44
CNN-GRU [13] UCI-HAR Simple 99.00 99.40 98.60 99.00
CNN-GRU [13] OPPORTUNITY Complex 96.60 97.20 96.00 96.60
CNN-LSTM [24] UTwente Combined 97.65 99.70 91.15 94.90
CNN-LSTM [24] UTwente Simple 98.70 99.10 95.40 97.00
CNN-LSTM [24] UTwente Complex 98.48 98.95 98.81 98.88
CNN-LSTM [24] UCI-HAR Simple 98.60 99.20 98.30 98.75
CNN-LSTM [24] OPPORTUNITY Complex 96.20 96.90 95.80 96.35

The CNNd-TAm model consistently surpasses both traditional models and HAR-specific benchmarks across all datasets and activity categories. CNNd-TAm attains accuracies of 98.50–99.15% for combined activities, markedly exceeding VGG-16 (94.70–97.60%) and ResNet-50 (95.30–98.10%). In the context of uncomplicated activities (UTwente, UCI-HAR), CNNd-TAm exhibits marginally greater accuracies (99.65–99.75%) compared to VGG-16 (97.30–97.60%) and ResNet-50 (97.90–98.10%), along with enhanced recall and F1 scores attributable to its optimized architecture for time-series data. In complex activities (UTwente, OPPORTUNITY), CNNd-TAm’s performance (98.30–99.42%) significantly surpasses that of VGG-16 (94.70–95.50%) and ResNet-50 (95.30–96.20%), due to its Dilated convolutional layers and temporal attention mechanism, which adeptly capture long-term dependencies essential for distinguishing between activities such as drinking and smoking.

The suboptimal performance of VGG-16 can be attributed to its deep architecture, which is susceptible to overfitting on smaller datasets (e.g., OPPORTUNITY) and lacks dedicated tools for temporal modeling. ResNet-50 outperforms VGG-16 owing to its residual connections; however, it remains less successful than CNNd-TAm, especially in complicated activities, where it demonstrates elevated mistake rates (4–6%) in differentiating identical patterns.

Comparison of accuracy and losses

To assess the efficacy of the proposed CNNd-TAm model relative to other deep learning architectures, we performed an extensive comparison of training accuracy and loss across several models utilizing the OPPORTUNITY and UTwente datasets for intricate tasks. The training procedure spanned 200 epochs. Figures 8 and 9 illustrate the training accuracy and loss curves for the UTwente and OPPORTUNITY datasets, respectively.

Fig. 9.

Fig. 9

Comparison of (a) training accuracy and (b) training loss for Hybrid Deep learning models on the UTwente dataset for complex activities over 200 epochs.

In the OPPORTUNITY dataset, seen in Fig. 8a, the CNNd-TAm model exhibits exceptional training accuracy, with a final accuracy of 98.30% after 200 epochs, surpassing all other models. The CNN-BiGRU model achieves a final accuracy of 97.50%, whilst CNN-BiLSTM, CNN-LSTM, and CNN-GRU attain final accuracies of 96.90%, 96.20%, and 96.60%, respectively. The accuracy curves demonstrate that CNNd-TAm converges more rapidly and exhibits a steady upward trajectory relative to other models, especially after 50 epochs, during which it maintains a competitive advantage. Figure 8b depicts the training loss trajectories for OPPORTUNITY. CNNd-TAm attains the minimal final loss of 0.022, indicating its strong learning proficiency. CNN-BiGRU and CNN-BiLSTM achieve final losses of 0.035 and 0.038, respectively, whereas CNN-LSTM and CNN-GRU exhibit somewhat elevated losses of 0.045 and 0.048. The loss curves for CNNd-TAm and CNN-BiGRU demonstrate more consistent convergence than those of CNN-LSTM and CNN-GRU, which exhibit intermittent volatility, especially during the initial epochs.

In the UTwente dataset, as illustrated in Fig. 9a, CNNd-TAm attains the highest final accuracy of 99.13%, above CNN-BiGRU (97.73%), CNN-BiLSTM (97.73%), CNN-LSTM (97.65%), and CNN-GRU (97.92%). The accuracy trends resemble those noted in OPPORTUNITY, with CNNd-TAm demonstrating expedited convergence and a steady advantage post 50 epochs. Figure 9b illustrates the training loss trajectories for UTwente, whereby CNNd-TAm attains a final loss of 0.0069, markedly inferior to CNN-BiGRU (0.0084), CNN-BiLSTM (0.0084), CNN-LSTM (0.0238), and CNN-GRU (0.0089). The loss curves for UTwente demonstrate a more consistent convergence for CNNd-TAm in comparison to other models, exhibiting fewer variations throughout all epochs.

Statistical evaluation of model efficacy

This subsection provides variance measurements for the performance of CNNd-TAm across tenfold cross-validation on the UTwente, UCI-HAR, and OPPORTUNITY datasets, specifically for accuracy and F1 score, in response to the reviewer’s request for statistical robustness. Table 7 presents the mean, standard deviation (SD), and 95% confidence intervals (CI) for various parameters, obtained from Tables 3, 4, 5 and 6. UCI-HAR encompasses only simple and combination actions, as it does not incorporate sophisticated tasks like as typing or speaking. OPPORTUNITY encompasses straightforward, intricate, and hybrid tasks, illustrating its varied range of operations.

Table 7.

Statistical analysis of CNNd-TAm performance across datasets.

Dataset Activity type Metric Mean (%) SD (%) 95% CI
UTwente Combined Accuracy 99.15 0.25 [98.9, 99.4]
F1 Score 98.82 0.28 [98.5, 99.1]
Simple Accuracy 99.75 0.15 [99.6, 99.9]
F1 Score 99.52 0.18 [99.3, 99.7]
Complex Accuracy 99.42 0.20 [99.2, 99.6]
F1 Score 99.60 0.22 [99.4, 99.8]
UCI-HAR Combined Accuracy 99.50 0.20 [99.3, 99.7]
F1 Score 99.45 0.22 [99.2, 99.7]
Simple Accuracy 99.65 0.18 [99.5, 99.8]
F1 Score 99.57 0.15 [99.4, 99.7]
OPPORTUNITY Combined Accuracy 98.40 0.27 [98.1, 98.7]
F1 Score 98.50 0.25 [98.2, 98.8]
Simple Accuracy 98.70 0.23 [98.5, 98.9]
F1 Score 98.80 0.18 [98.6, 99.0]
Complex Accuracy 98.30 0.25 [98.1, 98.5]
F1 Score 98.35 0.30 [98.0, 98.7]

The performance of CNNd-TAm was assessed using tenfold cross-validation on the UTwente, UCI-HAR, and OPPORTUNITY datasets, with variance metrics for accuracy and F1 score presented in Table 7. For UTwente, integrated activities result in an accuracy of 99.15 ± 0.25% (CI: [98.9, 99.4]) and an F1 score of 98.82 ± 0.28%. The UCI-HAR dataset, restricted to basic and combination activities in the absence of sophisticated tasks (e.g., typing, conversing), attains an accuracy of 99.50 ± 0.20% (CI: [99.3, 99.7]) and an F1 score of 99.45 ± 0.22% (CI: [99.2, 99.7]). OPPORTUNITY demonstrates an accuracy of 98.40 ± 0.27% and an F1 score of 98.50 ± 0.25% (CI: [98.2, 98.8]) across simple, difficult, and mixed activities in Fig. 10. Minimal standard deviations (0.15–0.28%) and narrow confidence intervals signify strong performance across datasets, bolstered by the Dilated CNN and dual attention mechanism for reliable activity identification.

Fig. 10.

Fig. 10

Performance metrics for combined activities across datasets.

Generalizability and limitations

The UTwente dataset, consisting of 13 activities from 10 users, offers comprehensive sensor data; nevertheless, it is constrained by minimal user variety and a small sample size, which may hinder its capacity to represent real-world heterogeneity in age, gender, or movement styles. The dataset’s activity distribution, albeit balanced, encompasses intricate behaviors like as smoking or verbal gestures with non-repetitive patterns, which may inadequately represent dynamic real-world circumstances. Notwithstanding these limitations, the CNNd-TAm model exhibits strong generalizability, as indicated by low standard deviations (0.15–0.25%) and narrow confidence intervals (e.g., 99.28–99.56% for complex activities) in tenfold cross-validation on UTwente (Tables 3, 4 and 5). The Dilated convolutions and dual attention mechanism proficiently manage varied patterns, reducing errors to less than 0.8% in differentiating identical activities such as drinking and smoking. Performance on UCI-HAR (99.65% accuracy, SD 0.18%) and OPPORTUNITY (98.30% accuracy, SD 0.25%) further substantiates adaptability to datasets with increased user diversity or intricate behaviors. For WISDM, akin to UCI-HAR, CNNd-TAm is anticipated to attain similar accuracy (> 98.5%, SD < 0.30%) owing to its multi-scale feature extraction capabilities. The model’s computational efficiency facilitates deployment on wearable devices, improving practical applicability. Future research may enhance generalizability by integrating various user data or use online learning to adapt to novel activities, so ensuring consistent performance across a range of real-world context.

Conclusion

The CNNd-TAm model employs three complementing components Dilated convolutions, a BiGRU layer, and parallel attention mechanisms to concurrently process spatial, temporal, and adaptive attention variables. The application of dilation convolutions, utilizing varying dilation rates (1, 2, and 4), enhances the model’s temporal field of view and facilitates the concurrent capture of local and global features without elevating computing expenses. BiGRU, by processing sequences from two temporal directions, concurrently analyzes past and future information, enabling the acquisition of both long-term and short-term patterns without augmenting parameters. This is particularly beneficial for intricate tasks. The Attention module emphasizes crucial points and essential information within the temporal sequence by concentrating on significant frames. These techniques, by assigning adaptive weights to frames, enable the model to concentrate on more significant aspects of the data, hence enhancing its performance in recognizing challenging categories such as drinking, typing, or smoking.

The suggested model demonstrates superior accuracy when confronted with classes exhibiting closely related temporal patterns or brief and erratic motions. The investigation of the confusion matrix in intricate activities indicated that the mistake rate between classes such as “drinking” and “smoking,” or “typing” and “writing,” in the CNNd-TAm model has been markedly diminished in comparison to other fundamental models. This results from the integration of BiGRU temporal memory with adaptive attention that concentrates on frames with significant informational density, enhancing the model’s sensitivity and accuracy in detecting nuanced and complex behaviors.

The astute determination of the temporal window size, contingent upon the characteristics of the activities, is a critical aspect in the efficacy of the suggested model. The utilization of 20, 30, and 40 s intervals for simple, general, and complex operations, respectively, has enhanced the equilibrium between information quantity, noise mitigation, and temporal relevance. The implementation of the Butterworth filter, Min–Max normalization, linear interpolation for missing data, and 50% overlap in data slicing has enhanced the consistency and quality of the inputs. The integration of these measures with tenfold cross validation has effectively mitigated overfitting and improved the model’s performance on novel data.

This model possesses significant potential for practical applications in health monitoring, sports activity tracking, and intelligent behavior monitoring systems. Nonetheless, the reliance on computational complexity and data quality are drawbacks that require consideration in forthcoming study. Considering the stated constraints, future advancements can be enhanced by employing multi-objective data, such as heart rate or EEG, with larger and more diversified datasets regarding age, gender, and exercise style, hence augmenting the model’s generalizability. The benefit also lies in its application to low-power systems through the utilization of transfer learning and model dimensionality reduction. The strategic selection of time window size to preserve temporal dependence enhances model performance through the implementation of adaptive time windows. These guidelines can surmount existing constraints and broaden the model’s applicability in medical, athletics, and intelligent relationships.

Acknowledgements

The research was supported by Natural Science Foundation of Ningxia: Research on Matrix Patching Theory and Algorithm Application Based on Smooth Matrix Decomposition (2023AAC03333).

Author contributions

All authors contributed to the study conception and design. Data collection, simulation and analysis were performed by “Zhixue Wang and Kai Kang”.

Funding

The authors did not receive any financial support for this study.

Data availability

Availability of data and materials The datasets used and/or analyzed during the current study available from the corresponding author on reasonable request. This study used the UTwente dataset, which is publicly available. The dataset link is available at the following link: https://research.utwente.nl/en/organisations/ut/datasets/

Declarations

Competing interests

The authors declare no competing interests.

Ethical approval

Not applicable.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Chen, K. et al. Deep learning for sensor-based human activity recognition: Overview, challenges, and opportunities. ACM Computing Surv. (CSUR)54(4), 1–40 (2021). [Google Scholar]
  • 2.Zhang, S. et al. Deep learning in human activity recognition with wearable sensors: A review on advances. Sensors22(4), 1476 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Serpush, F., Menhaj, M. B., Masoumi, B. & Karasfi, B. Wearable sensor-based human activity recognition in the smart healthcare system. Comput. Intell. Neurosci.2022(1), 1391906 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Fakhri, P. S. et al. A fuzzy decision-making system for video tracking with multiple objects in non-stationary conditions. Heliyon9(11), e22156 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
  • 5.Ramanujam, E., Perumal, T. & Padmavathi, S. Human activity recognition with smartphone and wearable sensors using deep learning techniques: A review. IEEE Sens. J.21(12), 13029–13040 (2021). [Google Scholar]
  • 6.Liu, X., Li, G., Zhou, X., Liang, X. & Hou, Z. A weight-aware-based multisource unsupervised domain adaptation method for human motion intention recognition. IEEE Trans. Cybern.55(7), 3131–3143. 10.1109/TCYB.2025.3565754 (2025). [DOI] [PubMed] [Google Scholar]
  • 7.Jordao, A., Nazare Jr, A. C., Sena, J., & Schwartz, W. R. (2018). Human activity recognition based on wearable sensor data: A standardization of the state-of-the-art. arXiv preprint arXiv:1806.05226.
  • 8.Shoaib, M., Bosch, S., Incel, O. D., Scholten, H. & Havinga, P. J. Complex human activity recognition using smartphone and wrist-worn motion sensors. Sensors16(4), 426 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Zhao, Y. et al. Highly sensitive, wearable piezoresistive methylcellulose/chitosan@MXene aerogel sensor array for real-time monitoring of physiological signals of pilots. Sci. China Mater.68(2), 542–551. 10.1007/s40843-024-3188-4 (2025). [Google Scholar]
  • 10.Mekruksavanich, S. & Jitpattanakul, A. Deep convolutional neural network with RNNs for complex activity recognition using wrist-worn wearable sensor data. Electronics10(14), 1685 (2021). [Google Scholar]
  • 11.Zhang, X. et al. Bioinspired flexible kevlar/hydrogel composites with antipuncture and strain-sensing properties for personal protective equipment. ACS Appl. Mater. Interfaces16(34), 45473–45486. 10.1021/acsami.4c08659 (2024). [DOI] [PubMed] [Google Scholar]
  • 12.Ariza-Colpas, P. P. et al. Human activity recognition data analysis: History, evolutions, and new trends. Sensors22(9), 3401 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Gupta, S. Deep learning based human activity recognition (HAR) using wearable sensor data. Int. J. Inf. Manag. Data Insights1(2), 100046 (2021). [Google Scholar]
  • 14.Qi, H. et al. Electrospun green fluorescent-highly anisotropic conductive Janus-type nanoribbon hydrogel array film for multiple stimulus response sensors. Compos. Part B: Eng.288, 111933. 10.1016/j.compositesb.2024.111933 (2025). [Google Scholar]
  • 15.Khan, I. U., Afzal, S. & Lee, J. W. Human activity recognition via hybrid deep learning based model. Sensors22(1), 323 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Deep, S., & Zheng, X. (2019). Hybrid model featuring CNN and LSTM architecture for human activity recognition on smartphone sensor data. In 2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT) (pp. 259–264).
  • 17.Wang, H. et al. Wearable sensor-based human activity recognition using hybrid deep learning techniques. Secur. Commun. Netw.2020(1), 2132138 (2020). [Google Scholar]
  • 18.Luwe, Y. J., Lee, C. P. & Lim, K. M. Wearable sensor-based human activity recognition with hybrid deep learning model. Informatics9(3), 56 (2022). [Google Scholar]
  • 19.Thu, N. T. H. & Han, D. S. HiHAR: A hierarchical hybrid deep learning architecture for wearable sensor-based human activity recognition. IEEE Access9, 145271–145281 (2021). [Google Scholar]
  • 20.Abbaspour, S. et al. A comparative analysis of hybrid deep learning models for human activity recognition. Sensors20(19), 5707 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Xia, K., Huang, J. & Wang, H. LSTM-CNN architecture for human activity recognition. IEEE Access8, 56855–56866 (2020). [Google Scholar]
  • 22.Challa, S. K., Kumar, A. & Semwal, V. B. A multibranch CNN-BiLSTM model for human activity recognition using wearable sensor data. Vis. Comput.38(12), 4095–4109 (2022). [Google Scholar]
  • 23.Dirgová Luptáková, I., Kubovčík, M. & Pospíchal, J. Wearable sensor-based human activity recognition with transformer model. Sensors22(5), 1911 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Khatun, M. A. et al. Deep CNN-LSTM with self-attention model for human activity recognition using wearable sensor. IEEE J. Transl. Eng. Health Med.10, 1–16 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Athota, R. K. & Sumathi, D. Human activity recognition based on hybrid learning algorithm for wearable sensor data. Int. J. Inf. Technol.14(7), 3539–3548 (2022). [Google Scholar]
  • 26.Dua, N., Singh, S. N., Challa, S. K., Semwal, V. B., & Sai Kumar, M. L. S. (2022, December). A survey on human activity recognition using deep learning techniques and wearable sensor data. In international conference on machine learning, image processing, network security and data sciences (pp. 52–71). Cham: Springer Nature Switzerland.
  • 27.Qin, Z., Zhang, Y., Meng, S., Qin, Z. & Choo, K. K. R. Imaging and fusing time series for wearable sensor-based human activity recognition. Inf. Fusion53, 80–87 (2020). [Google Scholar]
  • 28.Khan, D. et al. Robust human locomotion and localization activity recognition over multisensory. Front. Physiol.15, 1344887. 10.3389/fphys.2024.1344887 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Thakur, D. & Biswas, S. Feature fusion using deep learning for smartphone based human activity recognition. Int. J. Inf. Technol.13(4), 1615–1624 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Wenxin Zhang, Ning Xu, Ning Zhao, A. A. Al-Barakati. Adaptive neural finite-time self-triggered control for nonstrict-feedback nonlinear systems with sensor faults. Robotic Intelligence and Automation, , (2025)
  • 31.Xiangjun Wu, Shuo Ding, Ning Zhao, Huanqing Wang, Ben Niu, Neural-network-based event-triggered adaptive secure fault-tolerantcontainment control for nonlinear multi-agent systems underdenial-of-service attacks, Neural Networks [DOI] [PubMed]
  • 32.Liu, X., Song, L., Liu, S. & Zhang, Y. A review of deep-learning-based medical image segmentation methods. Sustainability13(3), 1224 (2022). [Google Scholar]
  • 33.Roodschild, M., Gotay-Sardiñas, J., Jimenez, V. A., & Will, A. (2024). Zorro: A Flexible and Differentiable Parametric Family of Activation Functions That Extends ReLU and GELU. arXiv preprint arXiv:2409.19239.
  • 34.Katrompas, A., & Metsis, V. (2024, June). Many-to-Many Prediction for Effective Modeling of Frequent Label Transitions in Time Series. In proceedings of the 17th International Conference on PErvasive Technologies Related to Assistive Environments 265–272.
  • 35.Wang, Y., Li, M., Liu, J., Leng, Z., Li, F. W. B., Zhang, Z.,... Liang, X. (2025). Fg-T2M++: LLMs-Augmented Fine-Grained Text Driven Human Motion Generation. International Journal of Computer Vision, 133(7), 4277–4293 .
  • 36.Thakur, D., Guzzo, A. & Fortino, G. Attention-based multihead deep learning framework for online activity monitoring with smartwatch sensors. IEEE Internet Things J.10(20), 17746–17754 (2023). [Google Scholar]
  • 37.Thakur, D., Guzzo, A. & Fortino, G. Intelligent adaptive real-time monitoring and recognition system for human activities. IEEE Trans. Ind. Inf.20(11), 13212 (2024). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Availability of data and materials The datasets used and/or analyzed during the current study available from the corresponding author on reasonable request. This study used the UTwente dataset, which is publicly available. The dataset link is available at the following link: https://research.utwente.nl/en/organisations/ut/datasets/


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES