Skip to main content
Sensors (Basel, Switzerland) logoLink to Sensors (Basel, Switzerland)
. 2023 Feb 17;23(4):2278. doi: 10.3390/s23042278

A Deep Learning-Based Semantic Segmentation Model Using MCNN and Attention Layer for Human Activity Recognition

Sang-hyub Lee 1, Deok-Won Lee 1, Mun Sang Kim 1,*
Editor: Antonio Fernández-Caballero1
PMCID: PMC9965081  PMID: 36850876

Abstract

With the development of wearable devices such as smartwatches, several studies have been conducted on the recognition of various human activities. Various types of data are used, e.g., acceleration data collected using an inertial measurement unit sensor. Most scholars segmented the entire timeseries data with a fixed window size before performing recognition. However, this approach has limitations in performance because the execution time of the human activity is usually unknown. Therefore, there have been many attempts to solve this problem through the method of activity recognition by sliding the classification window along the time axis. In this study, we propose a method for classifying all frames rather than a window-based recognition method. For implementation, features extracted using multiple convolutional neural networks with different kernel sizes were fused and used. In addition, similar to the convolutional block attention module, an attention layer to each channel and spatial level is applied to improve the model recognition performance. To verify the performance of the proposed model and prove the effectiveness of the proposed method on human activity recognition, evaluation experiments were performed. For comparison, models using various basic deep learning modules and models, in which all frames were classified for recognizing a specific wave in electrocardiography data were applied. As a result, the proposed model reported the best F1-score (over 0.9) for all kinds of target activities compared to other deep learning-based recognition models. Further, for the improvement verification of the proposed CEF method, the proposed method was compared with three types of SW method. As a result, the proposed method reported the 0.154 higher F1-score than SW. In the case of the designed model, the F1-score was higher as much as 0.184.

Keywords: human activity recognition, transitional activities, deep learning, accelerometer sensor, attention layer, semantic segmentation

1. Introduction

1.1. Research Background

Various issues about the safety and health of the elderly have emerged in our aging society. Studies are being conducted to prevent these issues. Particularly, the awareness of daily activity is becoming more important because it is directly related to the health of the elderly. Owing to an aging society, the elderly population is increasing, but there is a limit to the manpower that can take care of them; thus, a technology that can replace the elderly care manpower is required. For this reason, with the recent development of wearable devices and deep learning (DL)-based artificial intelligence technology, human activity recognition (HAR) is being employed to recognize what people are doing through a series of data over time.

HAR is a technology suitable for the current healthcare field in an aging society. This is because data on perceived human activity can be used in various technological fields, such as human–computer interaction and human–robot interaction (HRI) [1]. By fusion with the internet of things (IoT) technology or timeseries sensor data, it is possible to propose appropriate services for various targets. For example, a mobile robot that can operate in an indoor environment can provide appropriate and proactive new services such as medication recognition for the elderly considering recognized activities. In addition, HAR can generate significant information for implementing a home care system. It is crucial to quickly and accurately recognize issues directly related to diseases or health, such as falls, in the time domain. In this sense, the HAR technology can be of great significance for distributing a monitoring system in a real environment [2].

There are three typical types of data used for HAR [3]. The first one is biosignal data such as electroencephalography, electromyography, and electrocardiography (ECG). Such data cannot be collected easily because the data collection requires specific equipment, including an electrode for recording electrical signals. The second one is behavior-sensing data such as the type of image obtained from an RGB-D sensor. They provide lots of useful information for HAR in the form of the original image and the skeleton type extracted from the depth image. However, because there are issues such as privacy invasion, it is unsuitable for application to people’s home environments. In addition, an RGB-D sensor has a coverage limitation in that the target must be located in the field of view of the sensor during data collection, and there should be no occlusions that compromise the data quality. These issues significantly influence the recognition of the behavior of many objects. Therefore, it is unsuitable for an elderly home care or monitoring system in daily life. The last type is activity-sensing data from an inertial measurement unit (IMU) comprising an accelerometer and a gyroscope. They have high usability according to the development of wearable devices, and the privacy issue is less severe. Moreover, because data can be collected from the sensor itself or using specific anchors with signal communication, coverage limitations are less impactful for the application than RGB-D sensors. Most IMU data are obtained from a wearable sensor attached to the body and have high scalability because they can easily be fused with other sensor data that have timeseries characteristics. For example, if IMU data are combined with indoor localization technology such as ultra-wide-band (UWB) sensing, not only the information necessary for behavior recognition but also context information such as the target position can be obtained. Such sensor fusion improves the performance of behavior recognition and can be the key to HRI or IoT technology. For that reason, IMU data are the most appropriate data type for HAR in daily life [4]. Therefore, a new HAR method using IMU data, particularly acceleration data, is proposed in this study.

According to the development of DL, many scholars have performed HAR using timeseries acceleration data collected with wearable sensors. An elaborate HAR can be achieved by detecting the start and end points of the target activity in timeseries data that include single or multiple activities. Most scholars first performed a segmentation task on the entire timeseries data into the optimal size for classifying target activities [5] and then classified each segment. However, this approach has limitations in performance because human activities are not standardized for each person [6]. In detail, the fixed-size window (FSW) method, shown on the left of Figure 1, cannot cover properly when the target activity execution time is larger than the window size. In addition, the case that there are multiple activities in a single window causes low classification accuracy.

Figure 1.

Figure 1

(Left) fixed-size window and (Right) sliding window for activity recognition.

To tackle the above issue, the sliding window (SW) method, shown on the right of Figure 1, has been used recently. The classification window moves along the time axis considering the size of the overlapping area, and classification is performed for every step. However, as with the FSW method, the SW method still has an issue with determining the optimal window size and overlapping area. In the results of several studies, different optimal sizes have been reported for different datasets. In addition, there remains a generalization problem for the recognition performance of the obtained optimal size. In particular, because the duration of human activities is not constant, there are limitations in accurately classifying behavior even using SW.

To solve the above problems, a method of classification for every frame (CEF) in timeseries acceleration data is proposed in this study. The proposed method is similar to the segmentation method presented in fields of two-dimensional (2D) image recognition named semantic segmentation. In addition, a DL-based new architecture is designed for conducting semantic segmentation on three-axis acceleration data.

1.2. Related Work

Many scholars have conducted recognition-based model development studies for HAR using convolutional neural networks (CNNs) or recurrent neural networks (RNNs). Particularly, most scholars have developed a HAR model using CNN. In [7], a one-dimensional CNN (1D-CNN)-based model for HAR was proposed. For the input, the acceleration data from multiple IMU sensors attached at different positions of body parts were used. The proposed model applies multiple CNN (MCNN) and pooling layers to each sensor data separately. In addition, extracted features are concatenated and used to predict the segment class. The proposed classification model predicts which behavior each segment corresponds to. For the evaluation, three public datasets of human activity presented in [8,9,10] were used. The window sizes were 0.72, 3, and 1 s for each dataset, and the size of the overlapping area of SW was set to 50%, 78%, and 99% of the segmented window sizes, respectively. As the result, the accuracy was 92.22%, 93.68%, and 70.80%, respectively. The authors of [11] proposed a 1D-CNN-based model for recognition. The proposed model extracts meaningful features by capturing local dependencies and scale invariance of timeseries activity data acquired by IMU. Similar to [7], the recognition model included a channel-wise 1D-CNN layer (applying 1D-CNN layers to x, y, and z channels) and a pooling layer. Moreover, the window size and the size of the overlapping area were 64 and 50%, respectively. For the evaluation, the public human activity datasets using IMU presented in [8,12,13] were used. The accuracy for each dataset was 76.83%, 88.19%, and 96.88%. The authors of [14] proposed a HAR model using 1D-CNN and a conquer-based classifier. First, the proposed model recognized activity as static (sit, stand, and lay) or dynamic (walking, walking upstairs, and walking downstairs) activity using binary classification. Then, two three-class classifiers were implemented to predict the class of each FSW. Finally, test data sharpening was adopted to improve the HAR performance. The window size and size of the overlapping area were 500 and 250 ms, respectively. The proposed model was evaluated on two public datasets presented in [8,15]. As a result, the accuracy for each dataset was 94.2% and 97.62%. In [16], multiple DL architectures, including deep feed-forward neural network, CNN, long short-term memory (LSTM), and bidirectional LSTM, were implemented for HAR. The recognition models were evaluated on three public datasets [8,9,17] using different window sizes (1, 5.12, and 1 s) and the size of the overlapping area (50%, 78%, and 50%). As a result, the bidirectional LSTM showed the best performance on the three datasets, with F1-scores of 0.929 for [8], 0.745 for [9], and 0.76 for [17]. The authors of [18] used stacked LSTM modules for HAR. Further, the SW method was implemented with a window size of 10 s, and 90% of the window size was set as the size of the overlapping area. As a result, an accuracy of 94% was achieved for the public dataset presented in [19]. The authors of [20] proposed a HAR model using bidirectional LSTM modules with a residual connection. For the classifier implementation, the input data have FSW with a 2.56-s segment. In addition, the size of the overlapping area of SW was 50% of the window size. Notably, the residual connection made the model optimization much easier than the original structure because the gradient values used in the learning process could be spread to the layers more directly through the residual connection. As a result, the F1-score for two public datasets presented in [8,15] was 0.905 and 0.935. The key characteristics of introduced previous works are shown in Table 1.

Table 1.

Key characteristics of previous works.

Author Dataset Window Size (s) Sliding Window (%) Accuracy (%) Model
[7] [8] 0.7 50 92.22 1D-CNN
[9] 3 78 93.68 1D-CNN
[10] 1 99 70.80 1D-CNN
[11] [8] 64 50 76.83 1D-CNN
[12] 64 50 88.19 1D-CNN
[13] 64 50 96.88 1D-CNN
[14] [8] 500 50 94.2 1D-CNN
[15] 500 50 97.62 1D-CNN
[16] [8] 1 50 0.929 (F1-Score) CNN + LSTM
[9] 5.12 78 0.745 (F1-Score) CNN + LSTM
[17] 1 50 0.76 (F1-Score) CNN + LSTM
[18] [19] 10 90 94 LSTM
[20] [8] 2.56 50 0.905 (F1-Score) LSTM
[15] 2.56 50 0.935 (F1-Score) LSTM

Owing to the aforementioned studies, the performance of HAR using SW has been improved by applying the DL technology. However, some issues remain. First, the size of the fixed window differs for each proposed model. Although the same model is used, the optimal size of the fixed window and the size of the overlapping area of SW differ for different datasets. This means that the SW using a fixed size is difficult to generalize. In other words, when the SW method is implemented on data collected in different environments, the error rate could be increased. Second, as mentioned in Section 1, the duration of human activity is usually variable. Therefore, the performance of a proposed HAR model could vary according to the size of the window used as input for the recognition model. Similarly, the size of the overlapping area of SW also affects the performance. The authors of [19] found that window size is a key parameter for improving recognition accuracy. A too-small window size could not include the entire activity, and a too-large window size could be a reason for classification error. In [6], the window size of SW significantly influenced the recognition performance. In addition, the authors mentioned that the optimal window size is hard to predefine because of the inconstancy of the type and duration of human activity. Further, a predefined optimal window size could differ for various unseen activities. Therefore, the SW method finds it difficult to handle the various activities. Finally, in the process of learning SW, the issue of determining the label of each segment remains. Most SW studies set the label of each segment as the class corresponding to the most part in the segment or the last frame of the segment to improve the training performance. This means that more than one activity could exist in a single window, and the proportion occupied by each class may be biased. This can degrade the performance of a recognition model on real data and prevent accurate classification. To tackle these issues, the CEF-based semantic segmentation method is proposed in this study.

The remainder of this article is structured as follows. Section 2 describes a DL model made up of stacking MCNN and the designed attention layer. In addition, a new dataset comprising only types of transition activities is described in Section 2. Section 3 describes the evaluation results of the proposed method compared with the basic DL models and previous work that conducted semantic segmentation of ECG data. Further, the performance of CEF was evaluated by comparing it with the existing SW method. Section 4 analyzes and discusses the experimental results. Finally, the conclusion and future work are presented in Section 5.

2. Materials and Methods

2.1. CEF Using DL Model

To implement CEF, the method of semantic segmentation was adopted. Semantic segmentation was proposed for 2D image segmentation; it usually means detecting pixels corresponding to the target in an input image. Similar to image segmentation, the semantic segmentation method was adopted for timeseries data in this study. The designed recognition model used MCNN as the feature extraction layer in timeseries data. In addition, an attention layer similar to CBAM was designed to improve recognition performance.

2.1.1. Feature Extraction Block Using MCNN

The timeseries data comprises different features along the time axis. In the case of data collected using an IMU sensor, the features correspond to the x, y, and z axes. For the feature extraction of timeseries data, 1D-CNN is appropriate because the kernel only moves along the time axis. In addition, the convolutional kernel extracts the features using data of a certain size in the time range. Therefore, 1D-CNN can operate as a local feature extractor in timeseries data. However, the definition of optimal kernel size for improving the recognition performance could not be specified, similar to the problem of SW based on a fixed window. Consequently, multiple 1D-CNNs with different kernel sizes were adopted. Multiple features that consider multiple receptive fields with various ranges can be extracted using MCNN. The designed architecture in this study was inspired by the SPP-Net proposed for image classification [21]. The proposed feature extraction layer uses five 1D-CNN layers with different kernel sizes of 5, 10, 20, 50, and 100. Each layer controls the padding size to make the size of the output data to be the same as that of the input. Each kernel performs a convolution operation on the input data and extracts features by sliding along the time axis. In other words, the features of each time point are extracted considering the surrounding data in different ranges. Then, the extracted features are fed into two 1D-CNNs with kernel size 1. Afterward, different features from each 1D-CNN are concatenated along the channel axis. Finally, features are fed into a single 1D-CNN layer with kernel size 1. The last layer not only adjusts the size of features, but also performs the fusion considering meaningful features among features extracted from each kernel with a different size. The detailed shape of the feature extraction block is shown in Figure 2. After every 1D-CNN layer, a batch normalization function is added to prevent overfitting; a layer normalization function is additionally added to prevent the feature values from becoming too large. The proposed feature extraction block is stacked in several steps to make our model deep.

Figure 2.

Figure 2

Feature extraction block using MCNN.

2.1.2. Implementation of the Attention Layer

To improve recognition performance, an attention layer similar to CBAM presented in [22] was implemented in the designed model. In addition, the attention layer integrates multiple features by weighting for each channel and each time step. Further, it can guide the feature extraction block to extract important features as well. The designed attention layer was based on CBAM reported in image recognition; it performs the attention mechanisms for channel and spatial separately. For the channel attention layer of CBAM, an attention score indicating which input channel is more important is generated with a probability distribution. To obtain the channel-level attention map used for calculating the attention score, average pooling and max pooling are applied to the input data in the spatial direction. Then, the attention score is calculated using a sigmoid function after a feed-forward network. Similarly, in the case of spatial attention of CBAM, an attention score is calculated using a 1D-CNN on the compressed result using average pooling and max pooling in the channel direction. In this study, the attention layer similar to the described CBAM model was applied to timeseries data. As an output, a new highlighted important feature could be acquired.

The channel attention layer of the proposed model is the same as that in CBAM as shown in the upper part of Figure 3. The input of the attention layer is compressed values of input data applying average pooling and max pooling to the direction of the time axis. Then, the attention map generation was achieved through the two types of inputs (average and max) passing through the same two 1D-CNN layers and the activation function, namely, rectified linear unit (RelU). The number of filters used in the first 1D-CNN of the channel attention layer was 1/16 of the number of input channels. Then, the number of filters used in the second 1D-CNN was recovered as the number of filters of input. This is the same as the channel attention layer described in the original CBAM, which increases the generalizing performance of the model, as explained in [22]. After the attention map generation, two attention maps are added element by element. Finally, through a sigmoid function that makes the values to be in the range of 0–1, the attention score is obtained. Then, the score is multiplied by input data for the channel axis. As a result, more important channels that better represent the data could be emphasized.

Figure 3.

Figure 3

Designed CBAM (up) channel level and (down) spatial level.

For attention at the spatial level, a self-attention layer is adopted. The original spatial attention layer presented in [22] is inappropriate for semantic segmentation in timeseries data because spatial information is greatly lost when average and max pooling are applied on the time axis. Therefore, the dot-product self-attention layer presented in [23], as described in the lower part of Figure 3, was used. There are three specific features—query, key, and value—that represent the input data differently. All features are generated through different 1D-CNN layers with the same input data; at this time, the same size as the number of frames of input data is maintained in the output. Then, the attention map is achieved by multiplying the query and key. The attention map of the original self-attention is a relation of each position. In timeseries, the relation of each time step is represented in an attention map. Then, the attention score is achieved using a sigmoid activation function similar to the channel attention layer. Finally, the output of the layer was derived by multiplying the attention score and the value representing the input data. This means that the features of each time step of the output are emphasized considering the entire data. This does not lose positional (time-domain) information and allows the model to be trained to recognize every frame without specifying the input data size. In other words, when calculating the features of a specific frame, the network can be trained to emphasize the features in positions important for classifying. Therefore, it is more suitable for performing semantic segmentation than the spatial attention layer in CBAM.

2.1.3. Semantic Segmentation and Loss Function

The proposed model is designed by stacking three structures comprising a feature extraction layer and an attention layer. In addition, the output of each layer maintains the size of data for the time axis the same as the initial input data. Thus, if necessary, zero-padding is applied to the input data. As mentioned above, semantic segmentation in a 2D image means classifying every pixel in the image. In this study, to apply this approach to timeseries, the feature size of the final output of the model was matched with the size of the target classes to be predicted. For implementation, the output was fed into a fully connected layer with the same filter size as the target classes. Finally, the features corresponding to each frame pass through the softmax function to generate a probability distribution and are encoded with the value of the position with the maximum probability. To train the proposed DL model, the cross-entropy loss, a loss function mainly used in classification problems, is applied in every frame. The losses generated in each frame are summed up as the final loss value of training the proposed model, as described in Formula (1). In the actual training phase, the number of filters used in every block and layer was 64.

Total Cross Entropy Loss=j=1Ti=1Cyjilogzji, (1)

where T denotes the length of the time axis of input data, C denotes the number of classes, yji denotes the true label of data at time j, and zji denotes the probability from the softmax function of the recognition model for class i at time j.

2.2. Dataset Construction

There are various human activity datasets comprising acceleration data, such as WISDM, UCI HAR, and MHEALTH [24], but most of them focus on the change in the human state, not on the transition activity. The human state is changed by transition activity and can mainly refer to human postures. For example, after a transition activity of sitting, the human state becomes seated. However, to precisely recognize the target activity, it is important to recognize the transition activity in which the target state is transformed. In addition, if the transition activity can be recognized with high accuracy, the human state can be predicted easily. Nevertheless, most public datasets are labeled the same for the transition activity and subsequent target state. In other words, there is no distinguished label between the state and transition activity. Therefore, a new dataset comprising only types of transition activity was constructed in this study.

As previously mentioned, there are various issues in public datasets for implementing and evaluating semantic segmentation for human activity data. Therefore, a new dataset was constructed using watch type IMU. The target activities comprise get-up, laying, stand-up, picking, sitting, and walking. Further, the background class means no movement is included. The target classes comprise behaviors that can occur in human daily life, which are commonly included in many public datasets. Data comprising two activities are included in the dataset. The two behavioral data types include all combinations that humans could perform in the target classes. The sensors used to construct the dataset consisted of a UWB sensor and an IMU consisting of a 3-axis (x, y, z-axis) accelerometer (LIS2DS12TR, STMicroelectronics), as shown in Figure 4. The UWB signal, which provides the location information of the indoor sensor, was not used in this study. However, it will be used for future studies that use context information to improve recognition performance. The acceleration data capturing speed was set to 15 fps. Therefore, the movement of the subject was captured with the sampling duration of 66.6 ms. All subjects wore the provided sensors on their right wrists and performed the motion for 250 frames. This means the size of a single sample (the input data) was 250 frames. Therefore, every target activity data were collected by all subjects, not only the single action, but also the two behavioral data (the combination of two action), has a size of 250 frames. Subjects were 8 males between the ages of 20 and 40 years. In addition, all subjects performed 6 single actions and 12 multi-actions, 10 times each, for a total of 180 times. All data were labeled with the corresponding activity by pinpointing the starting and ending points. The labeling procedure was performed manually by one person who watched all movements of all subjects. In addition, activities were performed at various time points in the single data. The state and target activities of the dataset are described in Table 2.

Figure 4.

Figure 4

IMU sensor.

Table 2.

Status of constructed dataset.

ID a b c d e f g h
get-up 10 10 10 10 10 10 10 10
laying 10 10 10 10 10 10 10 10
stand-up 10 10 10 10 10 10 10 10
picking 10 10 10 10 10 10 10 10
sitting 10 10 10 10 10 10 10 10
walking 10 10 10 10 10 10 10 10
walking—picking 10 10 10 10 10 10 10 10
walking—sitting 10 10 10 10 10 10 10 10
stand-up—walking 10 10 10 10 10 10 10 10
sitting—laying 10 10 10 10 10 10 10 10
get-up—stand-up 10 10 10 10 10 10 10 10
picking—walking 10 10 10 10 10 10 10 10
get-up—laying 10 10 10 10 10 10 10 10
laying—get-up 10 10 10 10 10 10 10 10
stand-up—picking 10 10 10 10 10 10 10 10
stand-up—sitting 10 10 10 10 10 10 10 10
picking—sitting 10 10 10 10 10 10 10 10
sitting—stand-up 10 10 10 10 10 10 10 10

3. Results

Two experiments were performed to evaluate the proposed model on the new dataset. First, we evaluated the performance improvement compared with the basic DL modules and models proposed in the ECG segmentation studies. Second, we experimented to evaluate how the method of CEF proposed in this study is more accurate than SW. The evaluation metrics for all verification were the F1-score, precision, and recall:

Precision=TPTP+FP (2)
Recall=TPTP+FN   (3)
F1=2×Precision×RecallPrecision+Recall, (4)

where TP (True Positive) means a result of predicting a class in which the predicted value for the data at a specific point in time is actually correct, FP (false positive) means that the predicted value of data that is not a specific class is recognized as that specific class, and FN (False Negative) means a result recognizing that data that are not actually of the corresponding class are of the corresponding class.

In addition, experiments were performed on an Intel i9-11900F octa-core microprocessor clocked at 2.50 GHz with 32 GB RAM. For operating the proposed DL model and all comparison models, the RTX 3070 GPU was used. Model development and implementation were performed using Pytorch version 1.10.2 and Python version 3.7, respectively. The size designed model was 13.063 MB with 3,416,583 parameters. All models were trained using a leave-one-subject-out cross-validation, with the number of epochs fixed at 200 during each training. The learning rate and batch size for learning were set to 0.001 and 100 in all experiments, and the Adam method was employed for optimization. The experimental results are described below, and the analysis is performed in the next section.

In the first experiment, the basic DL modules used for comparison included the gated recurrent unit (GRU), LSTM, and CNN modules with different kernel sizes. For RNNs, bidirectional modules were also employed for comparison, and kernel sizes of 5, 10, 20, and 40 steps were used for CNN. All basic DL modules were stacked three times, including batch normalization and ReLu activation function; finally, an output having the same feature size as the input data was obtained through a fully connected layer. In addition, several models presented in [25,26,27,28,29] that reported high accuracy by applying the CEF method to ECG data were adopted for comparison. As mentioned in Section 2, the loss is calculated by comparing the output of all models with the same size as the input and the label on the time axis. Details of the results for each activity are presented in Table 3. Additionally, the computational cost of all models (the number of parameters, and the size of the model) are described in Appendix A.

Table 3.

Comparison with other DL models for evaluating performance improvement.

Background Laying Picking Get-Up
Precision Recall F1-Score Precision Recall F1-Score Precision Recall F1-Score Precision Recall F1-Score
GRU 0.960 0.964 0.962 0.815 0.864 0.839 0.808 0.838 0.822 0.880 0.811 0.844
LSTM 0.958 0.967 0.962 0.850 0.849 0.850 0.824 0.818 0.821 0.890 0.804 0.845
RNN 0.950 0.964 0.957 0.817 0.778 0.797 0.770 0.731 0.750 0.781 0.776 0.778
Bi RNN 0.966 0.974 0.970 0.878 0.885 0.881 0.886 0.875 0.881 0.884 0.832 0.857
Bi LSTM 0.973 0.971 0.972 0.874 0.904 0.889 0.882 0.860 0.871 0.870 0.872 0.871
Bi GRU 0.971 0.974 0.972 0.890 0.901 0.895 0.895 0.885 0.890 0.894 0.872 0.883
CNN 5 0.960 0.975 0.968 0.800 0.769 0.784 0.779 0.754 0.766 0.812 0.750 0.780
CNN 10 0.967 0.979 0.973 0.861 0.849 0.855 0.856 0.867 0.861 0.882 0.834 0.857
CNN 20 0.966 0.982 0.974 0.909 0.858 0.882 0.898 0.890 0.894 0.896 0.883 0.889
CNN 40 0.966 0.981 0.973 0.907 0.893 0.900 0.908 0.861 0.884 0.907 0.894 0.900
[25] 0.962 0.973 0.968 0.882 0.807 0.843 0.815 0.854 0.834 0.881 0.805 0.841
[26] 0.957 0.967 0.962 0.828 0.872 0.85 0.792 0.713 0.75 0.837 0.848 0.843
[27] 0.939 0.803 0.866 0.582 0.561 0.571 0.403 0.502 0.447 0.594 0.66 0.625
[28] 0.917 0.974 0.944 0.759 0.273 0.401 0.695 0.698 0.696 0.677 0.605 0.639
[29] 0.931 0.99 0.959 0.945 0.769 0.848 0.936 0.779 0.851 0.948 0.752 0.839
Proposed 0.979 0.977 0.978 0.918 0.927 0.922 0.913 0.930 0.922 0.909 0.930 0.920
Stand-Up Siting Walking
Precision Recall F1-Score Precision Recall F1-Score Precision Recall F1-Score
GRU 0.871 0.763 0.813 0.873 0.855 0.864 0.884 0.912 0.898
LSTM 0.843 0.782 0.811 0.841 0.836 0.838 0.875 0.904 0.889
RNN 0.806 0.690 0.744 0.816 0.767 0.791 0.835 0.885 0.860
Bi RNN 0.891 0.845 0.867 0.876 0.866 0.871 0.928 0.935 0.931
Bi LSTM 0.860 0.871 0.866 0.900 0.892 0.896 0.926 0.934 0.930
Bi GRU 0.892 0.871 0.882 0.901 0.885 0.893 0.936 0.951 0.943
CNN 5 0.804 0.739 0.770 0.812 0.765 0.788 0.880 0.913 0.896
CNN 10 0.876 0.847 0.861 0.885 0.851 0.868 0.943 0.927 0.935
CNN 20 0.905 0.867 0.886 0.914 0.872 0.893 0.950 0.929 0.939
CNN 40 0.906 0.863 0.884 0.912 0.879 0.895 0.950 0.928 0.939
[25] 0.858 0.839 0.848 0.877 0.853 0.865 0.914 0.925 0.92
[26] 0.786 0.771 0.779 0.759 0.795 0.776 0.892 0.832 0.861
[27] 0.432 0.368 0.397 0.465 0.356 0.403 0.276 0.559 0.369
[28] 0.715 0.416 0.526 0.693 0.444 0.541 0.652 0.886 0.752
[29] 0.91 0.801 0.852 0.94 0.743 0.83 0.902 0.936 0.918
Proposed 0.906 0.896 0.901 0.915 0.905 0.910 0.950 0.953 0.952

The proposed model reported the highest performance in all classes, with an F1-score of 0.9 or higher. In detail, for the background, the proposed model reported the best F1-score and precision values of 0.979 and 0.978, respectively. However, for the recall value, [29] was the best, with 0.990. Considering laying, picking, get-up, stand-up, and sitting, for precision, the results of [29] were the best, with 0.945, 0.936, 0.948, 0.91, and 0.94; meanwhile, for recall, the results of the proposed model were the best, with 0.927, 0.930, 0.930, 0.896, and 0.905, respectively. For walking, the CNNs with kernel sizes of 20 and 40 steps had the highest precision, 0.950, but the recall of the proposed model was the highest (0.953). Overall, the F1-score of the proposed model was the best, with an average of 0.929, and it was 0.019 higher than that of the CNN with a 40 step kernel size, which is the highest F1-score among the comparison models. The confusion matrix of experiments is provided in Appendix A.

In the second experiment, the comparison models were the same as in the first experiment. However, to reproduce the SW method, the dataset comprising 250 frames for single data was divided into several segments according to the window size and the size of the overlapping area. Then, all recognition models classified the segment as a specific behavior. As a result, the output data differ from the input data and refer to the class of corresponding data that the recognition model predicted. For predicting overlapping areas, the class is determined by comparing the confidence between surrounding predictions. In other words, the prediction which has a higher confidence value from the same recognition model in different positions is selected as the final decision. For an accurate comparison, the predicted class is expanded by its original size, and the loss is calculated through comparison with the label. For training models, the aforementioned cross-entropy was adopted as the loss function. The various sizes of the fixed window were set to 10, 20, and 40 time steps, and each overlapping area had a size of 5, 10, and 20 steps. In addition, the evaluation criterion is the averaged F1-score of all activities. Details of the results are described in Table 4.

Table 4.

Results of comparison with the SW method.

Model GRU LSTM RNN Bi RNN Bi LSTM Bi GRU CNN 5 CNN 10 CNN 20 CNN 40 Proposed
SW 10-5 0.725 0.734 0.685 0.734 0.74 0.747 0.737 0.741 0.766 0.723 0.727
SW 20-10 0.631 0.657 0.65 0.753 0.725 0.717 0.672 0.701 0.781 0.718 0.71
SW 40-20 0.639 0.671 0.65 0.678 0.779 0.795 0.746 0.799 0.817 0.797 0.799
CEF 0.863 0.858 0.809 0.893 0.898 0.907 0.82 0.887 0.909 0.911 0.93

The proposed CEF method reported the best performance compared with three types of SW. The F1-score of SW with a window size of 10 and an overlapping area of 5 was 0.732. For SW with a window size of 20 and an overlapping area of 10, the F1-score was 0.701. In addition, the F1-score of SW with a window size of 40 and an overlapping area of 20 was 0.742. Finally, the proposed CEF method showed the best performance with a 0.88 F1-score, an improvement of 0.154 on average. In detail, the results of CEF showed 0.198, 0.17, and 0.147 improvements compared with GRU, LSTM, and RNN, respectively. Meanwhile, the results of bidirectional RNN showed improvements of 0.171, 0.15, and 0.154 compared with GRU, LSTM, and RNN, respectively. Compared with CNN, the proposed CEF method showed improvements of 0.101, 0.14, 0.121, and 0.165 for kernel sizes of 5, 10, 20, and 40, respectively. Finally, the F1-score of CEF was higher than that of SW by as much as 0.184.

4. Discussion

In this study, the performance of a model designed by stacking various basic DL modules was evaluated through the first experiment mentioned in Section 3. For RNN modules, we confirmed that the average F1-score was 0.083 lower than that of the bidirectional RNN using not only the previous but also the next information. RNN modules may be unsuitable for predicting behaviors with long execution times due to problems such as gradient vanishing. Further, because it is difficult to apply bidirectional RNN in real time, we judged that the CNN module is more suitable for implementing CEF. In addition, among the CNN modules, because the one with the largest kernel size has a wide receptive field, the performance is the best among basic DL models. This means that using more than one piece of surrounding information to classify a particular frame has an advantage over RNNs. We also confirmed that the size of the receptive field used when classifying a specific frame greatly affects the performance. Moreover, depending on the execution time of the target action and the amplitude of the signal, features extracted from receptive fields of different sizes can have a positive effect on the prediction performance because the proposed model using MCNN had the best F1-score.

The model presented in [29], which reported the highest precision value in transition activities of laying, get-up, stand-up, sitting, and picking that have short execution times, used both 1D-CNN and dilated 1D-CNN for feature extraction. This means that the features from multiple receptive fields of various sizes had a positive effect on recognition performance. Therefore, when classifying data at a specific location on the time axis, it is essential to fuse meaningful features using surrounding data of various sizes. In the results of the CNN module with a single kernel size, the F1-score of transition activities with short execution time was 0.094 lower than the activities that include repetitive patterns, such as background and walking. This is because the information that interferes with predicting the class of a specific frame is included in the process of passing the features extracted from the previous layer to the next layer, and it can be improved by selectively filtering the necessary features. The model based on an autoencoder presented in [25,27] can compress and remove relatively insignificant features, but it has a loss of positional information, resulting in a low F1-score of 0.7. In addition, Refs. [28,29], which adopted a U-net architecture with a skip connection to preserve positional information, reported a relatively high F1-score of 0.757, but its performance was still low. Moreover, the model of [25], which emphasizes meaningful features by applying an attention layer that can give low weight to insignificant features, reported higher performance than the aforementioned two methods (0.874).

Consequently, the proposed model was designed by stacking MCNN that can reflect receptive fields of various sizes and an attention layer that emphasizes meaningful features. The attention scores of the channel level were derived differently for each activity (Figure 5). In other words, the features obtained from different sizes of the receptive field were emphasized selectively according to the properties of the target activities. In addition, the channel attention layer was applied differently according to the execution time of the target activity. When the execution time of an action was long, features extracted by kernels of all sizes were evenly emphasized; meanwhile, when the execution time was short, features extracted from a receptive field of a short size were emphasized. For spatial attention, the area of the same behavior as the data of a specific time step to be recognized was emphasized (Figure 6). This improved the classification performance of data at a specific location by filtering the data that are not related to the target and helped demarcate the boundary between the background and activity or distinguish between different adjacent activities. In summary, the proposed model was designed as a stacked structure by the fusion of the two methods, and it reported the highest performance.

Figure 5.

Figure 5

Examples of attention score in channel level.

Figure 6.

Figure 6

Examples of attention score in spatial level.

In the second experiment, the proposed CEF was evaluated by comparison with the SW method. As a result, the performance of SW was lower than that of CEF because of several reasons. First, if the window size is too small or too large, a recognition error occurs. When a small amount of data is included, it may be insufficient to classify the data in the window. However, if a large amount of data is included in a single window, more than one activity may be involved, increasing the error. In other words, if a window contains more than one action, a misrecognition occurs, and it is impossible to specify the dividing point between the different actions. These problems can occur randomly according to the start and end points of the target behavior and recognition. From Figure 7, as a result of the SW method implemented in this study, misrecognition frequently occurred in the area corresponding to the division points, such as the start, end, and transition of the activities. Notably, the existing research treats the segment label as one class rather than a frame unit. This can have a positive effect on the recognition model’s training, but cannot perform quantitative evaluation precisely. Through additional tuning work, the performance of SW can be improved by obtaining the optimal window size and size of the overlapping area. However, performance improvements are not guaranteed for data with different behaviors or other datasets, because SW fundamentally depends only on the data contained in the window. Therefore, CEF, which is not limited by changes in window size, could perform activity recognition more precisely. In addition, by classifying each frame rather than the window unit, the distinction between various activities could be recognized more elaborately.

Figure 7.

Figure 7

Example of misrecognition of SW method.

In this study, the CEF method was proposed to overcome the limitations of SW. However, several limitations still remain. First, as mentioned above, a CNN module with a different kernel size is required depending on the properties of the target activity. Therefore, the proposed model used features extracted from various receptive fields. Nevertheless, if more complex behaviors that are difficult to distinguish from other activities need to be recognized, a different kernel size may be more suitable. Thus, the number of CNN modules and the kernel size of the currently designed model have to be set up experimentally. Second, for spatial attention, where the attention layer is applied to time-axis data, the attention map size for calculating the score increases as the input data size increases. This issue can cause limitations when applying the recognizing model to embedded systems. Third, in this study, a quantitative comparison with previous studies was not performed. As mentioned in Section 2, there is a limitation to using the public datasets used in the previous SW-based HAR research, and the issue is that the evaluation criterion of SW differs from that of CEF, but various quantitative comparisons with state-of-the-art studies are needed. Finally, if the wearing position of the sensor is changed, the recognition performance may decrease. Therefore, there is a need for a method that can respond to various structures of sensors for generalization.

5. Conclusions

In this study, a CEF method, rather than the conventional SW method, was proposed for HAR. For implementation, features extracted from various receptive fields were used and fused using MCNN. Moreover, we could selectively weight the extracted features by proposing a layer that applies the attention mechanism to each channel and spatial level similar to CBAM. The channel level has the same structure as that of CBAM. Meanwhile, for the spatial level, a dot-product self-attention layer that does not lose positional information was adopted. Further, the proposed recognition model was evaluated using a newly constructed dataset. As a result, the proposed recognition model reported a higher F1-score than the models using basic DL modules. In addition, the proposed model outperformed several models applying CEF to EGC data. An experiment was also performed to verify the superiority of the proposed method over the existing SW method. It was found that the CEF method can perform HAR more precisely than the SW method with three different window sizes and overlapping areas. In addition, we confirmed that the proposed model is suitable for implementing CEF for HAR.

The performance of the proposed CEF method and recognition model was verified through experiments. However, several issues remain to be resolved, necessitating further studies for improvement. First, the proposed model will be advanced using a DL method that is more suitable to timeseries data, such as the temporal convolutional network (TCN) structure presented in [30]. We expect that the performance will be increased because the TCN that had a great performance in timeseries data could better memorize long-term memory for the time axis. In addition, it is expected that memory usage, which increases according to the input data size in spatial attention, can be reduced. The second is the design of a canonical domain transformer layer to increase the generalization performance of the proposed model. It is possible to increase the generalization performance by simply increasing the diversity of the dataset using various augmentation methods, but it is not a fundamental solution and requires a lot of time. Therefore, a layer that transforms the input data or extracted feature into a domain advantageous for recognition is required. A method such as the canonical domain transformer model suggested in [31,32] for obtaining a transformation matrix based on input data will be adopted to improve the generalization performance. Finally, the positional and historical context information will be used to improve recognition performance. The positional context information can be extracted using the distance between the target position and surrounding objects, such as bed, chair, and desk. For instance, when the target is close to the bed, the static activities related to bed such as laying can be more natural than dynamic actions such as running. These rules correspond to the positional context information and will be used in the learning process of an adapted model. To facilitate the collection of positional context information, a UWB sensor that can be attached to various objects to provide the positional information of the sensor of the targets in real time will be used. Meanwhile, historical context information can be extracted from actions that a subject has performed before. For example, after the subject performs the laying action, the subject cannot walk without stand-up or get-up action. Such constraining will enable training a recognition model to reduce the weight of activities inappropriate for the situation. Through these additional studies, if the generalization performance of the proposed CEF method can be improved, it will contribute to the research field of healthcare technology that requires precise recognition, such as a human monitoring system or elderly home care. Further, the recognized human’s daily activity will be a significant controlling factor for the proactive service of robots.

Appendix A

Table A1.

Confusion matrix of comparison with other DL models for evaluating performance improvement.

GRU 226,833 839 1772 911 1395 1082 2427
1855 13,723 655 198 401 66 27
1578 305 14,750 27 233 163 9
2758 583 46 13,621 172 429 247
939 70 269 222 15,817 470 1093
1269 5 545 298 364 16,677 347
1123 66 55 359 1202 225 31,480
LSTM 227,391 718 1465 1125 1049 1268 2243
1821 13,614 687 257 395 151 0
1575 355 14,496 128 150 360 1
2547 533 22 13,958 44 497 255
1138 44 5 284 15,439 570 1400
1503 2 302 297 541 16,297 563
1340 33 86 513 1119 235 31,184
RNN 226,796 1127 1020 1378 1134 1115 2689
1756 13,133 1229 261 124 354 68
1742 1363 13,273 182 98 407 0
3126 1150 183 12,324 498 392 183
1435 30 57 412 13,800 792 2354
1953 0 454 272 1136 14,966 724
2000 23 28 455 1133 318 30,553
Bi RNN 229,210 866 1252 789 891 1039 1212
1836 14,083 549 248 10 199 0
1209 356 15,110 7 122 261 0
1351 506 0 15,086 273 357 283
1022 5 24 269 16,529 235 796
1398 91 277 284 343 16,898 214
1166 15 6 249 496 311 32,267
Bi LSTM 228,426 1215 1461 1152 832 1004 1169
1028 14,765 487 147 431 67 0
785 450 15,430 17 286 97 0
1148 528 0 15,559 153 333 135
1132 0 22 227 16,233 230 1036
1253 0 210 315 116 17,391 220
996 18 49 671 359 193 32,224
Bi GRU 229,067 981 1089 984 917 944 12,77
1135 14,765 515 164 273 73 0
981 353 15,375 11 150 195 0
1243 386 0 15,551 166 334 176
1143 0 0 216 16,707 207 607
1282 13 300 338 120 17,257 195
1049 10 0 163 332 141 32,815
CNN 5 229,481 722 881 771 705 824 1875
1331 12,697 1515 350 434 572 26
1540 1152 13,124 499 501 246 3
2198 278 159 13,191 973 650 407
1143 89 223 859 14,233 924 1409
1968 480 446 345 748 14,927 591
1405 213 57 399 682 247 31,507
CNN 10 23,0265 863 851 804 686 738 1052
1269 14,112 603 237 413 240 51
1533 389 14,496 143 306 188 10
1319 329 40 15,132 459 423 154
1140 63 167 327 16,367 340 476
1360 163 581 262 328 16,603 208
1136 86 107 373 569 236 32,003
CNN 20 23,1035 743 667 728 663 651 772
1339 14,937 381 156 62 47 3
1665 391 14,636 10 184 170 9
1310 244 8 15,490 201 412 191
1128 16 67 203 16,805 164 497
1390 85 316 323 165 17,012 214
1183 247 33 206 643 154 32,044
CNN 40 230,854 795 707 764 651 652 836
1305 15,130 354 95 5 36 0
1400 318 15,237 0 4 106 0
1532 232 8 15,412 170 362 140
1434 56 84 218 16,262 294 532
1314 86 321 307 139 17,152 186
1215 62 88 212 678 213 32,042
[25] 228,873 802 986 977 1403 775 1443
1860 13,624 477 332 281 293 58
2094 459 13,767 39 476 218 12
1518 424 0 14,983 265 456 210
934 26 28 340 16,130 354 1068
1493 100 348 545 188 16,638 193
1038 33 0 256 1042 230 31,911
[26] 22,7453 1217 1178 1414 1079 1500 1418
1552 14,358 339 315 68 258 35
1482 244 14,886 66 56 311 20
1822 696 34 13,774 492 365 673
1566 178 361 873 13,454 1443 1005
1808 113 812 331 614 15,497 330
2061 350 359 750 1231 1039 28,720
[27] 188,898 4613 4455 989 4807 1133 30,364
1888 11,169 253 510 107 232 2766
2505 640 9579 171 261 1062 2847
1794 1587 174 6568 2646 557 4530
1417 34 68 1455 9473 1634 4799
1919 407 1257 861 2766 6942 5353
2693 356 676 4644 3468 3382 19,291
[28] 229,161 1114 411 904 1039 583 2047
2984 10,237 286 876 83 107 2352
3744 2132 4654 235 1731 2027 2542
5394 646 214 7420 741 630 2811
1436 193 159 331 13,169 374 3218
5113 163 301 432 1511 8662 3323
2198 639 105 184 681 113 30,590
[29] 232,803 249 208 543 505 212 739
3652 12,728 219 289 0 0 37
3520 308 13,131 1 0 64 41
2381 116 0 14,294 238 316 511
2056 19 110 115 14,714 233 1633
3695 0 217 343 194 14,494 562
1918 0 10 128 69 93 32,292
Proposed 22,9838 945 948 884 993 757 956
700 15,752 277 204 0 0 0
620 451 15,821 22 0 148 0
819 177 0 15,924 204 405 240
712 0 0 96 17,566 154 352
958 0 196 354 161 17,651 185
1030 0 0 93 317 176 32,914

Table A2.

Computational cost of each model.

Model GRU LSTM RNN Bi RNN Bi LSTM Bi GRU CNN 5 CNN 10
Size of model (MB) 0.198 0.263 0.07 0.191 0.732 0.552 0.139 0.272
Number of parameters 51607 68487 17847 49255 191239 143911 36031 709911
Model CNN 20 CNN 40 [25] [26] [27] [28] [29] Proposed
Size of model (MB) 0.538 1.07 0.835 0.889 0.224 0.162 0.964 13.063
Number of parameters 140671 280191 217863 232839 58663 41975 251279 3416583

Author Contributions

S.-h.L. designed the algorithm, performed the experimental work, wrote the manuscript. D.-W.L. organized the experiment setup. Corresponding author: M.S.K. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to the data used in this study was lab-data. Also, the data does not contain any information about the subject’s identity.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study. And, written informed consent has been obtained from the patients to publish this paper.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Funding Statement

This work was supported by GIST Research Project grant funded by the GIST in 2022.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

References

  • 1.Wang J., Chen Y., Hao S., Peng X., Hu L. Deep learning for sensor-based activity recognition: A survey. Pattern Recognit. Lett. 2019;119:3–11. doi: 10.1016/j.patrec.2018.02.010. [DOI] [Google Scholar]
  • 2.Chen K., Zhang D., Yao L., Guo B., Yu Z., Liu Y. Deep learning for sensor-based human activity recognition: Overview, challenges, and opportunities. ACM Comput. Surv. 2021;54:1–40. doi: 10.1145/3447744. [DOI] [Google Scholar]
  • 3.Sun Z., Ke Q., Rahmani H., Bennamoun M., Wang G., Liu J. Human action recognition from various data modalities: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2022;45:3200–3225. doi: 10.1109/TPAMI.2022.3183112. [DOI] [PubMed] [Google Scholar]
  • 4.Demrozi F., Pravadelli G., Bihorac A., Rashidi P. Human activity recognition using inertial, physiological and environmental sensors: A comprehensive survey. IEEE Access. 2020;8:210816–210836. doi: 10.1109/ACCESS.2020.3037715. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Abdel-Salam R., Mostafa R., Hadhood M. Human activity recognition using wearable sensors: Review, challenges, evaluation benchmark; Proceedings of the International Workshop on Deep Learning for Human Activity Recognition; Kyoto, Japan. 8 January 2021; pp. 1–15. [Google Scholar]
  • 6.Uslu G., Baydere S. A Segmentation Scheme for Knowledge Discovery in Human Activity Spotting. IEEE Trans. Cybern. 2022;52:5668–5681. doi: 10.1109/TCYB.2021.3137753. [DOI] [PubMed] [Google Scholar]
  • 7.Rueda F.M., Grzeszick R., Fink G.A., Feldhorst S., Hompel M.T. Convolutional neural networks for human activity recognition using body-worn sensors. Informatics. 2018;5:26. doi: 10.3390/informatics5020026. [DOI] [Google Scholar]
  • 8.Chavarriaga R., Sagha H., Calatroni A., Digumarti S.T., Tröster G., Millán J.d.R., Roggen D. The Opportunity challenge: A benchmark database for on-body sensor-based activity recognition. Pattern Recognit. Lett. 2013;34:2033–2042. doi: 10.1016/j.patrec.2012.12.014. [DOI] [Google Scholar]
  • 9.Reiss A., Stricker D. Introducing a new benchmarked dataset for activity monitoring; Proceedings of the 2012 16th International Symposium on Wearable Computers; Newcastle, UK. 18–22 June 2012; Piscataway, NJ, USA: IEEE; 2021. pp. 108–109. [Google Scholar]
  • 10.Grzeszick R., Lenk J.M., Rueda F.M., Fink G.A., Feldhorst S., Ten Hompel M. Deep neural network based human activity recognition for the order picking process; Proceedings of the 4th international Workshop on Sensor-Based Activity Recognition and Interaction; Rostock, Germany. 21–22 September 2017; pp. 1–6. [Google Scholar]
  • 11.Zeng M., Nguyen L.T., Yu B., Mengshoel O.J., Zhu J., Wu P., Zhang J. Convolutional neural networks for human activity recognition using mobile sensors; Proceedings of the 6th International Conference on Mobile Computing, Applications and Services; Austin, TX, USA. 6–7 November 2014; Piscataway, NJ, USA: IEEE; 2014. pp. 197–205. [Google Scholar]
  • 12.Stiefmeier T., Roggen D., Ogris G., Lukowicz P., Tröster G. Wearable activity tracking in car manufacturing. IEEE Pervasive Comput. 2008;7:42–50. doi: 10.1109/MPRV.2008.40. [DOI] [Google Scholar]
  • 13.Lockhart J.W., Weiss G.M., Xue J.C., Gallagher S.T., Grosner A.B., Pulickal T.T. Design considerations for the WISDM smart phone-based sensor mining architecture; Proceedings of the Fifth International Workshop on Knowledge Discovery from Sensor Data; San Diego, CA, USA. 21 August 2011; pp. 25–33. [Google Scholar]
  • 14.Cho H., Yoon S.M. Divide and conquer-based 1D CNN human activity recognition using test data sharpening. Sensors. 2018;18:1055. doi: 10.3390/s18041055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Anguita D., Ghio A., Oneto L., Parra Perez X., Reyes Ortiz J.L. A public domain dataset for human activity recognition using smartphones; Proceedings of the 21th International European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning; Bruges, Belgium. 24–26 April 2013; pp. 437–442. [Google Scholar]
  • 16.Hammerla N.Y., Halloran S., Plötz T. Deep, convolutional, and recurrent models for human activity recognition using wearables. arXiv. 20161604.08880 [Google Scholar]
  • 17.Bachlin M., Roggen D., Troster G., Plotnik M., Inbar N., Meidan I., Herman T., Brozgol M., Shaviv E., Giladi N., et al. Potentials of Enhanced Context Awareness in Wearable Assistants for Parkinson’s Disease Patients with the Freezing of Gait Syndrome; Proceedings of the 2009 International Symposium on Wearable Computers; Linz, Austria. 4–7 September 2009; Piscataway, NJ, USA: IEEE; 2009. pp. 123–130. [DOI] [Google Scholar]
  • 18.Pienaar S.W., Malekian R. Human activity recognition using LSTM-RNN deep neural network architecture; Proceedings of the 2019 IEEE 2nd Wireless Africa Conference (WAC); Pretoria, South Africa. 18–20 August 2019; Piscataway, NJ, USA: IEEE; 2019. pp. 1–5. [Google Scholar]
  • 19.Kwapisz J.R., Weiss G.M., Moore S.A. Activity recognition using cell phone accelerometers. ACM SigKDD Explor. Newsl. 2011;12:74–82. doi: 10.1145/1964897.1964918. [DOI] [Google Scholar]
  • 20.Zhao Y., Yang R., Chevalier G., Xu X., Zhang Z. Deep residual bidir-LSTM for human activity recognition using wearable sensors. Math. Probl. Eng. 2018;2018:7316954. doi: 10.1155/2018/7316954. [DOI] [Google Scholar]
  • 21.He K., Zhang X., Ren S., Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015;37:1904–1916. doi: 10.1109/TPAMI.2015.2389824. [DOI] [PubMed] [Google Scholar]
  • 22.Woo S., Park J., Lee J.-Y., Kweon I.S. Cbam: Convolutional block attention module; Proceedings of the European Conference on Computer Vision (ECCV); Munich, Germany. 8–14 September 2018; pp. 3–19. [Google Scholar]
  • 23.Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017;30 [Google Scholar]
  • 24.Twomey N., Diethe T., Fafoutis X., Elsts A., McConville R., Flach P., Craddock I. A comprehensive study of activity recognition using accelerometers. Informatics. 2018;5:27. doi: 10.3390/informatics5020027. [DOI] [Google Scholar]
  • 25.Malali A., Hiriyannaiah S., Siddesh G., Srinivasa K., Sanjay N. Supervised ECG wave segmentation using convolutional LSTM. ICT Express. 2020;6:166–169. doi: 10.1016/j.icte.2020.04.004. [DOI] [Google Scholar]
  • 26.Matias P., Folgado D., Gamboa H., Carreiro A. Time Series Segmentation Using Neural Networks with Cross-Domain Transfer Learning. Electronics. 2021;10:1805. doi: 10.3390/electronics10151805. [DOI] [Google Scholar]
  • 27.Sereda I., Alekseev S., Koneva A., Kataev R., Osipov G. ECG segmentation by neural networks: Errors and correction; Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN); Budapest, Hungary. 14–19 July 2019; Piscataway, NJ, USA: IEEE; 2019. pp. 1–7. [Google Scholar]
  • 28.Moskalenko V., Zolotykh N., Osipov G. Deep learning for ECG segmentation; Proceedings of the International Conference on Neuroinformatics; Dolgoprudny, Russia. 7–11 October 2019; Berlin/Heidelberg, Germany: Springer; 2019. pp. 246–254. [Google Scholar]
  • 29.Liang X., Li L., Liu Y., Chen D., Wang X., Hu S., Wang J., Zhang H., Sun C., Liu C. ECG_SegNet: An ECG delineation model based on the encoder-decoder structure. Comput. Biol. Med. 2022;145:105445. doi: 10.1016/j.compbiomed.2022.105445. [DOI] [PubMed] [Google Scholar]
  • 30.Bai S., Kolter J.Z., Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv. 20181803.01271 [Google Scholar]
  • 31.Jaderberg M., Simonyan K., Zisserman A. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 2015;28 [Google Scholar]
  • 32.Qi C.R., Su H., Mo K., Guibas L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA. 21–26 July 2017; pp. 652–660. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Not applicable.


Articles from Sensors (Basel, Switzerland) are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES