Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2024 Oct 30;14:26058. doi: 10.1038/s41598-024-75782-7

Parallel multi-stage rectification networks for 3D skeleton-based motion prediction

Jianqi Zhong 1,2,#, Conghui Ye 1,2,#, Wenming Cao 1,2, Hao Wang 1,2,
PMCID: PMC11522317  PMID: 39472613

Abstract

It is noted that Recurrent Neural Networks (RNNs), which are widely used in human prediction tasks, have achieved promising performance in motion prediction, owing to RNNs’ robust capacity for spatial-temporal sequence modeling. However, RNN-based methods suffer from error accumulation due to their step-by-step prediction mechanism. Therefore, in this paper, we propose a three-stage parallel prediction network, which guides the output generation of these three networks with different objectives. In particular, we leverage the high-dimensional information in these three networks to fuse new information to generate the final output. In addition, we also designed a fusion block based on GRU and attention mechanism to extract high-dimensional information more efficiently. Extensive experiments show that our approach outperforms most recent methods in both short and long-term motion predictions on Human 3.6M, CMU Mocap, and 3DPW.

Keywords: Recurrent neural network, Error accumulation, Motion prediction

Subject terms: Engineering, Electrical and electronic engineering, Mechanical engineering

Introduction

3D human motion prediction is based on a person’s historical movement trend to predict their future movements. This task helps us to analyze future changes in response to motion trends in advance. It has very important applications in automatic motion generation13, autonomous driving4, human-computer interaction57.

Because human motion sequences progress in chronological order, Recurrent Neural Networks(RNN)811, recognized for their effectiveness in managing tasks that involve predicting sequences, naturally stand out as a classic method for tackling this particular task. In the context of RNN-based techniques, there’s a tendency for convergence towards static behaviors in the medium to long run, often accompanied by challenges related to abrupt shifts and the accumulation of errors, possibly arising from the complexities involved in training RNNs. Some efforts to tackle this problem propose the application of Generative Adversarial Networks (GANs)12,13 for modeling the human prediction task. Nonetheless, these models frequently present difficulties in the training process. The latest studies reveal that the attention mechanism within the Transformer can effectively capture the long-term relationships within time series, amplifying the model’s ability to detect dynamic actions and consequently yielding predictions that are closely aligned with the temporal context. The idea of attention14 is used to better train the model’s sensitivity to dynamic motion and generate predictive values with high temporal correlation. The attention-based model can compare the action in the last observable frame with each pose in the history sequence to model the current action and history action. Although attention models have recently excelled in long-term predictions, they may not be as good as RNNS at capturing short-term local features due to the nature of their global attention. Nowadays, Graph Convolutional Networks(GCN) have been widely used in various fields and human prediction tasks. A lot of work shows that GCN is very suitable for solving Human Motion Prediction(HMP) problems1517. They regard each joint of the human skeleton in the data set as a node of the figure and build edges between the joints. Then GCN is used to capture the spatial relationship between joints, which is conducive to attitude prediction. In constructing deep learning networks based on GCN,18 adopts a “series progression” network structure, using intermediate targets to make the predicted values closer to the true ground values, and achieved quite good results. However, this recursive network structure has similar characteristics to RNN networks, in that the output of the previous stage based on the intermediate goal is used as the input for the next stage. Because the output of the previous stage is not necessarily correct, the input for the next stage is not the true value, which usually leads to error accumulation.

To solve the problem of error accumulation, we abandon the recursive network structure and propose a three-level parallel correction network, that is, based on the ground truth, we use two different operations to generate two different sequences of actions as two intermediate targets, and then use these two intermediate targets for collaborative prediction. Our method can correct the predicted value from both ends and avoid error accumulation. Additionally, in order not to miss any information, the output of the encoder is used, namely the information in the high dimensional space (hereinafter referred to as the high dimensional information) for fusion correction. For better fusion correction, we propose a fusion block to fuse high-dimensional information of three encoders. A fusion block includes a short-term fusion block based on Gated Recurrent Unit (GRU) and a long-term information diffusion block based on attention mechanism. However, due to the recursive structure of GRU, which is a special type of RNN, it inevitably accumulates errors, resulting in a decrease in performance in long-term prediction. To reduce the error accumulation characteristic, we add an attention mechanism to GRU, and GRU and attention mechanism work together. The attention mechanism captures short-term high-dimensional features based on GRU and fuses long-term high-dimensional features to capture the long-term dependencies of the sequence. In this way, the fusion block can not only have the advantages of GRU in capturing short-term features but also reduce the error accumulation characteristics of RNN through attention mechanism, improving the ability to capture long-term features.

In short, the main contributions of this paper are as follows:

  • We propose a three-stage parallel prediction network, which takes two kinds of action sequences based on real value as the intermediate target, and on this basis, the predicted value approaches the real value from both ends. The two intermediate target networks work together to correct the predicted value for higher accuracy.

  • We propose a fusion block based on the collaboration of GRU and attention mechanism. The attention mechanism is used to alleviate the error accumulation problem of GRU, which can effectively capture the global features of action sequences.

  • Extensive experiments are conducted to suggest that our method outperforms previous approaches by large margins on three public datasets.

Related work

RNNs-based methods(recurrent neural networks)

Based on 3D skeleton has attracted significant attention in recent years and is an important research topic of practical importance. Among human motion prediction, the one based on deep learning techniques is at the forefront. Typical approaches represent human motion as a learning problem from sequence to sequence (seq2seq)10,11,19, and specifically, RNN has been proposed to capture temporal information about human motion. To solve the problems of gradient blow-up and gradient disappearance in RNN, researchers have proposed two models 3-layer long short-term memory (LSTM-3LR) and encoder-recurrent-decoder (ERD)10. Between them, LSTM is used to extract long-term correlation. The above models are action-specific models, and there is usually a significant discontinuity between the first and the last frame of the prediction. Besides, researchers have also proposed various variants of RNN, such as the hierarchical motion recurrent model20 and Verso-Time Label Noise-RNN21, etc. The utilization of RNN as a framework for sequential computation unavoidably gives rise to the accumulation of errors.

GCN-based methods(graphs convolutional networks)

Graphs effectively represent a large amount of non-grid structured data and explicitly depict the correlation of vertices22,23.GCN as an extension of CNN, is suitable for data with specific graph structures, e.g., social networks24 and 3D skeleton data21,25. In recent years, many scholars have investigated graph neural networks (GNNs), extending deep networks to the graph domain. GNNs use a hierarchical architecture and end-to-end training. Lebailly et al.13 use GCN as an encoder and GCN to decode the aggregated features. Graph Convolutional Networks (GCN)26 simplifies ChebyNet and combines spectral analysis and spatial operations. Feature aggregation on the graph of GCN is designed directly from the vertex perspective, similar to the convolution on an image. Researchers have also proposed various variants of GNN such as DMGNN(Dynamic Multiscale Graph Neural Networks)27. DMGNN divides the human skeleton into three scales, and for different levels of features, the multiscale graph presents dynamic changes. By learning the integrated multi-scale feature representation, DMGNN helps achieve a more accurate future motion prediction. Ma et al.18 used a spatiotemporal GCN to build a model to obtain more accurate long-term predictions by predicting the median value of human motion. We also build encoders and decoders with GCN as the basic module.

Attention-based methods

Compared with RNNS with error accumulation characteristics, Bahdanau et al.28 proposed self-attention, which allows the capture of the long-term dependence between input and output sequences, and, to a certain extent, reduces error accumulation. Xiong et al.29 proposed an attention-based GRU model that utilizes memory networks to enable information to flow between sentences. Aksan et al.30 proposed spatial and temporal self-attention blocks, which used attention mechanisms to extract spatiotemporal features in motion sequences, and then aggregated the most informative components to generate predictive results. Li et al.31. combined DCT with the attention mechanism, first transforming the human motion sequence into the frequency domain through DCT, then using the attention mechanism to evaluate the feature gravity in the frequency domain, and finally converting the results back to the time domain for output. In addition, Zhou et al.32 proposed SKTformer, which successfully reduces computational complexity when using attention mechanism for long sequence modeling. M lbh et al. proposed a novel skeleton-based transformer model called Tempose33. Tempose employs multiple temporal and interaction layers to effectively capture human behavior while minimizing dependence on non-human visual environments. As a result, it offers significant versatility in its applications.

In this paper, we take into account both the error accumulation of RNN networks and the relatively weak ability of the attention mechanism to capture local features and propose a fusion module of GRU and attention mechanism cooperation to integrate short-term and long-term information. Our model outperforms most models in both short-term and long-term predictions on Human3.6M, CMU-MoCap, and 3DPW datasets.

Methods

We assume that the pose based on the historical motion sequence is Inline graphic : Inline graphic, consisting of Inline graphic consecutive human poses, where Inline graphic denotes a pose at frame i. And Inline graphic denotes the future pose sequence of length Inline graphic. Our goal then is to predict the poses Inline graphic for the future Inline graphic time steps. Following18, we repeated the last observed pose Inline graphic, Inline graphic times and append them to Inline graphic. The padded input sequence Inline graphic of length L with Inline graphic has been acquired. Accordingly, our goal has been shifted to finding a mapping Inline graphic from the padded sequence to its ground truth Inline graphic. Our model aims to train better Inline graphic.

Here we propose a Three-Stage Parallel Rectification Networks (PMRNet) for motion prediction. Figure 1 illustrates the architecture of the proposed PMRNet. Given the padded input Inline graphic : Inline graphic, we first pass it through three GCN-based encoders, En1, En2, and En3, each producing a set of latent features Inline graphic, Inline graphic, and Inline graphic, respectively. The three encoders are all residual blocks composed of several GCLs and Inline graphic convolution, and their structures are the same. The decoders De1 and De2 then utilize Inline graphic and Inline graphic to generate two outputs, Inline graphic and Inline graphic. The three decoders have the same structure, consisting of several GCLs and Inline graphic convolution as residual blocks, and the final result is output through an ST-GCN. The specific structure of the encoder and decoder is described in “Encoder-copy-decoder stage” below.

Fig. 1.

Fig. 1

Overview of our parallel multi-stage human motion prediction framework containing 3 stages, it is worth noting that from top to bottom, Stage-1, Stage-3, and Stage-2 contain the corresponding number encoder and decoder respectively. Each stage takes the padded sequence derived from a truth sequence Inline graphic as input. Stage-3 is guided by the ground truth, while the remaining two stages are guided by the corresponding results of different calculations based on the ground truth. The encoder-decoder prediction networks used in the three stages are the same while the difference is that Stage-3 has one more high-dimensional information fusion block than Stage-1 and Stage-2. Please refer to the main text for more details.

To effectively integrate the information from Inline graphic, Inline graphic, and Inline graphic, we introduce a fusion block. This fusion block takes the three latent feature sets(Inline graphic, Inline graphic, and Inline graphic) as input and outputs Inline graphic, which encapsulates the combined information from the three layers. The details of the fusion block are elaborated in “Fusion block based on GRU(gated recurrent unit) and attention mechanism”. Finally, the output Inline graphic is fed into De3 to generate the output Inline graphic. The generation of Inline graphic is directly guided by the ground truth, while the generation of Inline graphic and Inline graphic is guided by sequences derived from the ground truth, as detailed in “Parallel multi-stage rectification networks”.

Parallel multi-stage rectification networks

To achieve the above objectives, we design a parallel multi-stage rectification networks prediction framework as shown in Fig. 1, it consists of three stages, noting that from top to bottom it is Stage-1, Stage-3, and Stage-2 ( Inline graphic, Inline graphic, Inline graphic stand for Stage-1, Stage-3, Stage-2 respectively). The input of these stages is the padded input Inline graphic, and these stages perform the following tasks respectively:

graphic file with name M45.gif 1

in which Inline graphic stands for extracting the high-dimensional information of Inline graphic and Inline graphic.We will introduce the operation of the Inline graphic below.

Based on this work18, we perform different operations on Inline graphic to get Inline graphic and Inline graphic, and use Inline graphic and Inline graphic as the targets of the corresponding stage networks Inline graphic and Inline graphic to guide the generation of Inline graphic and Inline graphic respectively. To guide the generation of Inline graphic, we use Inline graphic. The operation for Inline graphic is Accumulated Average Smoothing (AAS) and the inverse operation of AAS (I-AAS) is introduced in the following(See Fig. 2).

Fig. 2.

Fig. 2

AAS and I-AAS are explained from the physical point of view. The gray skeleton diagram is the true value, the red skeleton diagram is the result after the I-AAS operation on the gray skeleton diagram, and the green skeleton diagram is the result after the AAS operation on the gray skeleton diagram. On the right side of the figure, the arm is taken as an example. I-AAS operation is equivalent to increasing the motion amplitude of the arm, while AAS operation is equivalent to reducing the motion amplitude of the arm.

Suppose an action sequence Inline graphic has M nodes in the D-dimensional space, and each trajectory Inline graphic is composed of all coordinates of the same node in the action sequence: Inline graphic, then there are Inline graphic trajectories in the action sequence:Inline graphic. Since the processing method for each trajectory in this paper is the same, the subscript j is omitted in the following formula.

According to the above, the trajectory contains two parts: the historical part Inline graphic and the future part Inline graphic. We need to do the future part and keep the history part the same, the AAS algorithm is defined as

graphic file with name M69.gif 2

In our understanding, on a physical level, the AAS operation is equivalent to reducing the motion range of the pose. On the contrary, the inverse operation of AAS(I-AAS) increases the amplitude of attitude action. The I-AAS algorithm is defined as:

graphic file with name M70.gif 3

We apply I-AAS and AAS operations on Inline graphic respectively to get Inline graphic and Inline graphic, so that different stages can capture action characteristics of different action amplitude.

Encoder-copy-decoder stage

In this section, we introduce our network that fulfills the prediction task at each stage, the overview of which is illustrated in the upper middle of Fig. 1. These stages are composed of three ST-GCN-based encoders and three ST-GCN-based decoders with Stage-3 having an additional fusion block for integrating characteristic information of different motion amplitudes. In the following, we introduce them one by one.

For attitude prediction tasks, according to [20,21], GCN can effectively extract the spatial features of the attitude, and according to [22], GCN can also extract the temporal features of the attitude in the time dimension. In this paper, we use Spatio-Temporal combination GCN (ST-GCN) to directly extract the spatio-temporal characteristics of each joint in the pose. ST-GCN is used in the GCL of each encoder and decoder. Let’s go through the components one by one.

ST-GCN

Let Inline graphic be a pose sequence where L is the length of the sequence, M is the number of joints of a pose, and F indicates the number of features of a joint. We define two learnable adjacency matrices Inline graphic and Inline graphic respectively, where the Inline graphic measures the relationship between the joint pairs of a pose, and the Inline graphic extracts the information of the joint trajectory. ST-GCN computes:

graphic file with name M79.gif 4

where Inline graphic,Inline graphicindicates the learnable parameters of ST-GCN, Inline graphic represents the first and second dimensions of the transformation input, and finally Inline graphic is the output of ST-GCN.

GCL

As shown on the top right of Fig. 1, a GCL indicates that the inputs are sequentially entered into the ST-GCN, batch normalization, tanh, and dropout. GCL is used to extract the global spatio-temporal features of the pose sequence.

Encoder

As shown at the top of Fig. 1, The encoder is a residual block composed of Inline graphic convolution layer and multiple GCL. The first GCL maps the input X from Inline graphic(Inline graphic) to the feature space Inline graphic. We set Inline graphic in this paper. In addition, we use a Inline graphic convolution layer with F(Inline graphic) kernels to map the input X to the feature space Inline graphic and add it to the output of the encoder as the residual connection of the encoder.

Copy

The feature information of dimension Inline graphic encoded by the encoder is copied along the time dimension to obtain the feature map of size Inline graphic, and the obtained feature map is used as the input of the decoder and fusion block. According to18, the operation of “copy” can increase the size of the feature space and improve the predictive performance of the model.

Decoder

The decoder is a residual block consisting of multiple GCLs, an ST-GCN, and a Inline graphic convolution layer. The GCLs extract temporal and spatial information from the feature space of Inline graphic, and since the “copy” operation doubles the time dimension of the hidden layer in the decoder, the size of the adjacency matrix Inline graphic in the decoder’s GCLS is Inline graphic. Finally, the high-dimensional features are projected back into the attitude space(Inline graphic) through two ST-GCN. At the same time, we use a Inline graphic convolution with 3 kernels to operate on the decoder’s input to get its residual connection to the decoder’s output. The result length of the decoder is 2L, and we only keep the previous L pose as the final result.

Fusion block based on GRU(Gated Recurrent Unit) and attention mechanism

To fuse high-dimensional information across phases, we propose a fusion block based on GRU and attention mechanism. The function of the fusion block in this paper is to fuse and extract the high-dimensional features extracted by the three encoders. The output of the fusion block will be used as the input of decoder-3 after copy operation. For the convenience of the following, the outputs of encoder-1, encoder-2, and encoder-3 are represented successively as Inline graphic, Inline graphic, and Inline graphic. The sizes of Inline graphic, Inline graphic and Inline graphic are all Inline graphic.

See Fig. 3, in this paper, the fusion of three high-dimensional information is divided into two steps. The first step is to extract the first N frame Inline graphic of Inline graphic and input it into GRU together with Inline graphic and Inline graphic to get output Inline graphic. Then Inline graphic overwrites the first N frames of Inline graphic, i.e. Inline graphic. The second step is to input the high-dimensional information integrated with the first step into the diffusion block based on the attention mechanism to obtain new high-dimensional information, and the final output contains the high-dimensional information of the encoder 1-3. Next, we will introduce these two steps in details.

Fig. 3.

Fig. 3

Overview of fusion block for aggregating high-dimensional information in three stages. The GRU module generates short-term high-dimensional information, and the attention module generates one-time high-dimensional information to reduce the accumulation of errors generated by the GRU module. Finally, the output of the two modules is spliced along the time dimension and input to the decoder of Stage-3.

Short-term information fusion based on GRU

Before we input the high-dimensional information into the GRU, Inline graphic and Inline graphic are first integrated through the linear layer and the pooled layer. The formula is shown below:

graphic file with name M117.gif 5

where Inline graphic, Inline graphicare trainable linear mappings. Inline graphic is the temporal average pooling layer. Inline graphic. Inline graphic is spliced in space dimension. The spliced dimension of Inline graphic and Inline graphic is Inline graphic. Finally,Inline graphic.

GCN-based GRU(G-GRU). Details of the module are shown in the GRU on the top left of Fig. 3. The functionality of a GCN-based GRU(G-GRU) is to learn and update hidden states with the guide of high dimensional information based on Inline graphic and Inline graphic. Let Inline graphic, when Inline graphic, G-GRU takes two inputs:the initialstate Inline graphic,and frame t in Inline graphic, which is Inline graphic.Then G-GRU(Inline graphic) works as

graphic file with name M135.gif 6

where Inline graphic and Inline graphic are trainable linear mappings; Inline graphic is the ST-GCN mentioned in part 3.2 of this paper.

Finally, we spliced all the outputs of GRU along the time dimension :

graphic file with name M139.gif 7

where Inline graphic, N is a super parameter.For each G-GRU cell, it applies an ST-GCN on the hidden states for information propagation and generates high-dimensional information for the next frame.

Long-term information diffusion based on the attention mechanism

To avoid the characteristics of RNN error accumulation and diffuse the high-dimensional information of N frames fused in GRU to post Inline graphic frames, we propose a diffusion block based on the attention mechanism to generate high-dimensional information of Inline graphic frames once. Finally, we spliced the high-dimensional information of N frames extracted by GRU and the high-dimensional information of Inline graphic frames after attention diffusion along the time dimension as high-dimensional information after fusion.

As shown in Attention on the right of Fig. 3, queries and keys are used to calculate attention scores and then combine the corresponding values as weights. To do this, we first map the query and key to a vector space of dimension Inline graphic through two functions: Inline graphic and Inline graphic modeled with neural networks, where Inline graphic.This can be expressed as:

graphic file with name M148.gif 8

The input of Inline graphic is output Inline graphic in the GRU module, and the input of Inline graphic is the last Inline graphic frame of output Inline graphic of encoder-3. Inline graphic. Then, the attention scores of q and k are:

graphic file with name M155.gif 9

where Inline graphic. Finally, we multiply the attention score Inline graphic by the last Inline graphic frames of Inline graphic to get V through the full connection layer:

graphic file with name M160.gif 10

where Inline graphic is the output of Attention.The final input to decoder-3 is Inline graphic.

Experiments

Datasets

Human3.6 Human3.6M is a large public data set for 3D human pose estimation research[23]. There are 3.6 million 3D human poses and corresponding images in the Human3.6M dataset, with 7 experimental subjects (S1, S5, S6, S7, S8, S9, and S11) and 17 action scenarios, such as discussion, eating, movement, greeting, and other actions. Each pose has 32 exponential map joints. Similar to18, We convert them to 3D coordinate representations and discard 10 redundant joints. Global rotations and shifts of postures are excluded. The S5 and S11 are for testing and verification, respectively, while the rest are for training.

CMU-MoCap The CMU-MoCap consists of 5 categories of general movements:’human interaction’, ’interaction with the environment’, ’movement’, ’physical activity movement’, and ’situations scenarios’36. Each subject had 38 joints. There are generally 8 representative action categories in CMU-MoCap, including actions such as jumping, backing, running, sitting down, standing up, pickup, basketball, and cartwheels. Each pose contains 38 exponential map format joints, which are also translated into 3D coordinate representations. The global rotations and shifts of postures are also excluded. As per17,34, we keep 25 joints and discard the others. The division of training and test data sets is also the same as17,34.

3DPW 3DPW37 is a challenging dataset containing human movements captured from indoor and outdoor scenes. The postures in this dataset are represented in 3D space. Each pose contains 26 joints, 23 of which are used (the other 3 are redundant)

Comparison settings

Evaluation metrics

Mean Per Joint Position Error (MPJPE) in millimeters is the most widely used evaluation metric. Assume that the predicted pose sequence is Inline graphic and the corresponding ground truth is Inline graphic. Then the MPJPE loss is

graphic file with name M235.gif 11

where Inline graphic represents the predicted position of the j-th joint in frame t. Then Inline graphic is the actual position of the ground.

Lengths of input and output sequences

Human3.6M and CMU-MoCap have input lengths of 10 and output lengths of 25. 3DPW have input lengths of 10 and output lengths of 30.

Implementation details

There are 3 stages in our entire network. The encoder contains 2 GCLs and the decoder contains 4 GCLs. The framework contains 18 GCLs in total. N in the fusion block is equal to 15 in the long prediction and 10 in the short prediction. We employ Adam as the solver. The learning rate is initially 0.005 and multiplied by 0.96 after each epoch. The model is trained for 50 epochs with a batchsize of 32. The devices we used are an NVIDIA GeForce RTX 2060 GPU and an Intel(R) Core(TM) i7-10700 CPU.

Comparisons with previous approaches

We compare our method with Res. Sup.11, DMGNN27, LTD34, MSR17, PGBIG18, and DS-GCN35 on these three datasets. Res. Sup. is an early RNN-based approach. DMGNN uses GCN for feature extraction and RNN variant GRU for decoding. LTD is completely dependent on GCN and performs predictions in the frequency domain. MSR is the most recent way to implement LTD in a multi-scale manner. PGBIG is a method to build a network based on GCN and set intermediate targets. It is also the most advanced method to build a network by using intermediate targets at present. DS-GCN is one of the latest methods based on GCN. All these methods are previous state-of-the-art which release their code publicly.

Human3.6M. Table 1 shows the quantitative comparisons of short-term prediction (less than 400 ms) on Human3.6M between our method and the above four approaches. Table 2 shows the comparisons of long-term prediction (more than 400 ms but less than 1000 ms) on Human3.6M. In most cases, our results are better than those of the compared methods. In Fig. 4a, we show and compare the performance of the different approaches using statistics. In Fig. 4a, We take PGBIG as the baseline and subtract the prediction error of MSR, LTD, and our method from the prediction error of PGBIG. In Fig. 4a, the relative average prediction errors concerning PGBIG at every future timestamp are plotted. As can be seen, methods MSR and LTD are much less effective than PGBIG. And the effect of LTD and MSR is not much different. Our method outperformed PGBIG in all predictive timestamps, and our advantage was most significant at 160 ms.

Table 1.

Comparisons of short-term prediction on Human3.6 M. Results at 80 ms, 160 ms, 320 ms, 400 ms in the future are shown.

Scenarios Walking Eating Smoking Discussion
Millisecond 80 ms 160 ms 320 ms 400 ms 80 ms 160 ms 320 ms 400 ms 80 ms 160 ms 320 ms 400 ms 80 ms 160 ms 320 ms 400 ms
Res. Sup.11 29.4 50.8 76.0 81.5 16.8 30.6 56.9 68.7 23.0 42.6 70.1 82.7 32.9 61.2 90.9 96.2
DMGNN27 17.3 30.7 54.6 65.2 11.0 21.4 36.2 43.9 9.0 17.6 32.1 40.3 17.3 34.8 61.0 69.8
LTD34 12.3 23.0 39.8 46.1 8.4 16.9 33.2 40.7 7.9 16.2 31.9 38.9 12.5 27.4 58.5 71.7
MSR17 12.2 22.7 38.6 45.2 8.4 17.1 33.0 40.4 8.0 16.3 31.3 38.2 12.0 26.8 57.1 69.7
PGBIG18 Inline graphic Inline graphic Inline graphic Inline graphic 7.0 Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic 10.0 Inline graphic Inline graphic Inline graphic
DS-GCN35 11.0 22.3 38.8 45.1 Inline graphic 15.5 31.7 39.1 Inline graphic 14.7 29.7 36.6 Inline graphic 24.3 54.5 67.4
Ours 9.4 17.7 32.8 38.7 6.8 13.1 30.2 36.7 6.3 12.1 27.8 33.6 10.0 22.1 52.7 64.5
Scenarios Directions Greeting Phoning Posing
Millisecond 80 ms 160 ms 320 ms 400 ms 80 ms 160 ms 320 ms 400 ms 80 ms 160 ms 320 ms 400 ms 80 ms 160 ms 320 ms 400 ms
Res. Sup. 35.4 57.3 76.3 87.7 34.5 63.4 124.6 142.5 38.0 69.3 115.0 126.7 36.1 69.1 130.5 157.1
DMGNN 13.1 24.6 64.7 81.9 23.3 50.3 107.3 132.1 12.5 25.8 48.1 58.3 15.3 29.3 71.5 96.7
LTD 9.0 19.9 43.4 53.7 18.7 38.7 77.7 93.4 10.2 21.0 42.5 52.3 13.7 29.9 66.6 84.1
MSR 8.6 19.7 43.3 53.8 16.5 37.0 77.3 93.4 10.1 20.7 41.5 51.3 12.8 29.4 67.0 85.0
PGBIG 7.2 17.6 Inline graphic Inline graphic 15.2 34.1 Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic 10.7 25.7 Inline graphic Inline graphic
DS-GCN 6.8 Inline graphic Inline graphic 51.6 14.2 Inline graphic 72.1 87.3 8.5 19.2 40.3 49.8 Inline graphic Inline graphic 60.6 77.3
Ours Inline graphic 14.7 39.8 50.9 Inline graphic 31.2 69.7 85.7 8.2 17.4 38.5 47.6 9.1 23.6 58.7 74.8
Scenarios Purchases Sitting Sittingdown Takingphoto
Millisecond 80 ms 160 ms 320 ms 400 ms 80 ms 160 ms 320 ms 400 ms 80 ms 160 ms 320 ms 400 ms 80 ms 160 ms 320 ms 400 ms
Res. Sup. 36.3 60.3 86.5 95.9 42.6 81.4 134.7 151.8 47.3 86.0 145.8 168.9 26.1 47.6 81.4 94.7
DMGNN 21.4 38.7 75.7 92.7 11.9 25.1 44.6 50.2 15.0 32.9 77.1 93.0 13.6 29.0 46.0 58.8
LTD 15.6 32.8 65.7 79.3 10.6 21.9 46.3 57.9 16.1 31.1 61.5 75.5 9.9 20.9 45.0 56.6
MSR 14.8 32.4 66.1 79.6 10.5 22.0 46.3 57.8 16.1 31.6 62.5 76.8 9.9 21.0 44.6 56.3
PGBIG Inline graphic Inline graphic Inline graphic Inline graphic 8.8 Inline graphic 42.4 53.8 Inline graphic Inline graphic 57.4 71.5 Inline graphic 18.9 Inline graphic Inline graphic
DS-GCN 12.6 29.6 62.2 75.9 Inline graphic 19.3 42.8 54.3 14.1 28.0 Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic 53.5
Ours 11.5 26.6 59.6 72.5 8.4 17.9 Inline graphic Inline graphic 13.0 25.5 56.2 69.8 8.3 17.8 41.5 52.5
Scenarios Waiting Walkingdog Walkingtogether Average
Millisecond 80 ms 160 ms 320 ms 400 ms 80 ms 160 ms 320 ms 400 ms 80 ms 160 ms 320 ms 400 ms 80 ms 160 ms 320 ms 400 ms
Res. Sup. 30.6 57.8 106.2 121.5 64.2 102.1 141.1 164.4 26.8 50.1 80.2 92.2 34.7 62.0 101.1 115.5
DMGNN 12.2 24.2 59.6 77.5 47.1 93.3 160.1 171.2 14.3 26.7 50.1 63.2 17.0 33.6 65.9 79.7
LTD 11.4 24.0 50.1 61.5 23.4 46.2 83.5 96.0 10.5 21.0 38.5 45.2 12.7 26.1 52.3 63.5
MSR 10.7 23.1 48.3 59.2 20.7 42.9 80.4 93.3 10.6 20.9 37.4 43.9 12.1 25.6 51.6 62.9
PGBIG 8.9 Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic 8.7 Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
DS-GCN Inline graphic Inline graphic 44.2 55.2 19.6 41.8 77.6 90.2 9.0 19.7 36.3 42.6 Inline graphic 23.3 48.7 59.8
Ours 8.0 18.7 42.9 52.9 17.7 36.4 71.9 84.5 Inline graphic 17.5 34.0 40.4 9.8 20.8 46.6 57.2

The best results are highlighted in bold, and the second best is marked by underline.

Table 2.

Comparisons of long-term prediction on Human3.6 M. Results at 560 ms and 1000 ms in the future are shown. The best results are highlighted in bold, and the second best is marked by underline.

Scenarios Walking Eating Smoking Discussion Directions Greeting Phoning Posing
millisecond 560 ms 1000 ms 560 ms 1000 ms 560 ms 1000 ms 560m 1000 ms 560 ms 1000 ms 560 ms 1000 ms 560 ms 1000 ms 560 ms 1000 ms
Res. Sup. 81.7 100.7 79.9 100.2 94.8 137.4 121.3 161.7 110.1 152.5 156.1 166.5 141.2 131.5 194.7 240.2
DMGNN 73.4 95.8 58.1 86.7 50.9 72.2 81.9 138.3 110.1 115.8 152.5 157.7 78.9 98.6 163.9 310.1
LTD 54.1 59.8 53.4 77.8 50.7 72.6 91.6 121.5 71.0 101.8 115.4 148.8 69.2 103.1 114.5 173.0
MSR 52.7 63.0 52.5 77.1 49.5 71.6 88.6 117.6 71.2 100.6 116.3 147.2 68.3 104.4 116.3 174.3
PGBIG Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic 118.2 69.3 100.4 110.2 Inline graphic Inline graphic 102.7 Inline graphic 164.8
DS-GCN 52.7 59.7 51.8 76.1 48.0 71.1 87.0 116.3 Inline graphic Inline graphic 108.6 142.2 66.6 102.2 106.5 163.3
Ours 47.6 55.3 50.3 74.9 45.1 68.2 86.5 Inline graphic 68.7 98.9 Inline graphic 142.2 64.1 Inline graphic 105.1 Inline graphic
Scenarios Purchases Sitting Sitting down Taking photo Waiting Walking dog Walking together Average
Millisecond 560 ms 1000 ms 560 ms 1000 ms 560 ms 1000 ms 560 ms 1000 ms 560 ms 1000 ms 560 ms 1000 ms 560 ms 1000 ms 560 ms 1000 ms
Res. Sup. 122.7 160.3 167.4 201.5 205.3 277.6 117.0 143.2 146.2 196.2 191.3 209.0 107.6 131.1 97.6 130.5
DMGNN 118.6 153.8 60.1 104.9 122.1 168.8 91.6 120.7 106.0 136.7 194.0 182.3 83.4 115.9 103.0 137.2
LTD 102.0 143.5 78.3 119.7 100.0 150.2 77.4 119.8 79.4 108.1 111.9 148.9 55.0 65.6 81.6 114.3
MSR 101.6 139.2 78.2 120.0 102.8 155.5 77.9 121.9 76.3 106.3 111.9 148.2 52.9 65.9 81.1 114.2
PGBIG Inline graphic Inline graphic 74.4 116.1 96.7 147.8 Inline graphic 118.6 Inline graphic Inline graphic Inline graphic Inline graphic 51.9 64.3 Inline graphic Inline graphic
DS-GCN 97.5 137.7 74.9 117.7 Inline graphic Inline graphic 74.5 Inline graphic 73.1 105.6 109.8 147.6 Inline graphic 61.2 77.8 111.0
Ours 94.5 131.2 Inline graphic Inline graphic 95.8 146.3 72.3 117.7 71.7 102.5 102.7 138.3 50.1 Inline graphic 75.8 109.0

Fig. 4.

Fig. 4

(a) Advantage analysis (Human3.6M). The advantage of our method is most significant at 160 ms. (b) Advantage analysis (CMU-MoCap). The advantage of our method is most significant at 560 ms. (c) Advantage analysis (3DPW). The advantage of our method is most significant at 400 ms.

CMU-MoCap.We report the average MPJPE in the short-term predictions (Inline graphic  ms) and the long-term predictions (Inline graphic  ms) in Table 3. In Fig. 4b, We take PGBIG as the baseline and subtract the prediction error of MSR, LTD, and our method from the prediction error of PGBIG. The relative average prediction errors concerning PGBIG at every future timestamp are plotted. Our approach still outperformed the PGBIG in this dataset, and our advantage was most significant at 560 ms.

Table 3.

CMU-MoCap: comparisons of average prediction errors.

Millisecond 80 ms 160 ms 320 ms 400 ms 560 ms 1000 ms
Res. Sup. 24.0 43.0 74.5 87.2 105.5 136.3
DMGNN 13.6 24.1 47.0 58.8 77.4 112.6
LTD 9.3 17.1 33.0 40.9 55.8 86.2
MSR 8.1 15.2 30.6 38.6 53.7 83.0
PGBIG 7.6 14.3 29.0 36.6 53.7 80.1
DS-GCN Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
Ours 7.2 13.5 27.4 35.2 49.7 79.7

3DPW.We rep ort the average MPJPE in the short-term predictions (Inline graphic  ms) and the long-term predictions (Inline graphic  ms) in Table 4. In 4c, We take PGBIG as the baseline and subtract the prediction error of MSR, LTD, and our method from the prediction error of PGBIG. For this challenging data set, our approach works very well, especially for short-term prediction results, where our advantages are significant. Especially for the prediction results of 400 ms, the error of our method is 10.4 less than that of PGBIG.

Table 4.

3DPW: comparisons of average prediction errors.

Millisecond 200 ms 400 ms 600 ms 800 ms 1000 ms
Res. Sup. 113.9 173.1 191.9 201.1 210.7
DMGNN 37.3 67.8 94.5 109.7 123.6
LTD 35.6 67.8 90.6 106.9 117.8
MSR 37.8 71.3 93.9 110.8 121.5
PGBIG Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
Ours 22.9 47.9 75.9 92.8 102.4

Model complexity analysis

As shown in Table 5, our model has a smaller size compared to most mainstream methods. However, due to the addition of a fusion block, its size is larger than that of PGBIG.Since our network has three stages and requires two additional losses to be calculated, our model is slightly slower than LTD and PGBIG. If all fusion blocks use attention mechanism, the running speed will be improved, but the model size will slightly increase. If all fusion blocks use GRU, the running speed will decrease, but the model size will decrease.

Table 5.

Time cost and model size comparisons.

Method Train (Per batch) (ms) Test (Per batch) (ms) Model size (M)
DMGNN 473 85 46.90
LTD 114 30 2.55
Res. Sup. 191 57 6.30
PGBIG 145 43 1.74
TemPose 3.8
PMRNet-w/o Att 250 156 1.48
PMRNet-w/o GRU 117 41 2.19
PMRNet(w/GRU+ATT) 157 52 2.12

Ablation studies

We conducted ablation experiments on a human3.6 dataset to further explore and analyze our methods.

Architecture

Several design choices help to explore the effectiveness of our approach:(1) the multi-stage learning framework, (2) the choice of intermediate targets, and (3) The composition of the fusion block. Table 6 shows the ablation experiments on different variants of the full model. The full model has 3 different goal-directed stages, a fusion block containing the GRU, and the attention mechanism. The average prediction error is 63.82. (1)To show the effectiveness of ’three stages’, we test the case when there is only stage-3 and when there are only two stages. When predicting the network stage-3 but with 18 GCLs, i.e. 6 GCLs for the encoder and 12 GCLs for the decoder, the average prediction error increases sharply to 67.99. When there is no stage-1 and stage-2(represented by “w/o stage-1” and “w/o stage-2” respectively), that is, stage-3 and the other stage have 3 GCLs each for the encoder and 6 GCLs each for the decoder the average prediction error of the model is 73.09 and 73.85 respectively. Thus, the “three stages” have a significant effect on the prediction effect. (2)In the next experiment, we use the ground truth (GT) to guide all the intermediate outputs, which yields a prediction error of 65.38 on average. (3) Finally, we removed the GRU(represented by “w/o GRU”) and the Attention(represented by “w/o ATT”) mechanism from the fusion block respectively. The mean prediction error for the absence of GRU and the attention mechanism is 68.45 and 67.04, respectively. Figure 6 is a visualization of Ablations on architecture. It can be seen from Fig. 6 that the absence of a single stage has the biggest impact on the model. When the Attention mechanism is missing, the prediction effect of the model becomes better in a short time, but worse in a long time, indicating that GRU plays a great role in short-time prediction, but there is also a problem of error accumulation. In the absence of a GRU module, the prediction effect of the model becomes worse in all stages, indicating that Attention has a certain similarity with GRU. When the phase output fully lets GT guide, the model effect also deteriorates, indicating that it is necessary to generate intermediate targets.The best results are highlighted in bold.

Table 6.

Ablations on architecture. The best results are highlighted in bold, and the second best is marked by underline.

80 ms 160 ms 320 ms 400 ms 560 ms 720 ms 880 ms 1000 ms Average
Only stage-3 11.97 25.17 50.3 61.24 80.5 94.57 106.23 113.98 67.99
/o Stage-2 15.92 31.85 65.21 72.49 83.24 96.46 108.73 116.97 73.85
w/o Stage-1 15.54 31.33 65.19 72.49 82.73 94.94 107.52 115.02 73.09
Guided by GT at all stages 10.88 22.23 48.74 59.97 76.43 91.04 102.52 111.27 65.38
w/o GRU in fusion block 14.93 24.78 54.38 64.65 80.51 93.53 103.59 111.27 68.45
w/o ATT in fusion block 9.43 20.17 45.47 58.12 79.92 96.22 109.57 117.43 67.04
PMRNet (w/GRU+ATT) Inline graphic Inline graphic Inline graphic 57.28 75.86 89.85 101.21 109.09 63.82
Fig. 6.

Fig. 6

Visualization of ablation experiments on architecture. The full model is the baseline. We subtract the prediction errors of the full model from those of the compared models.

Qualitative visualization

To put a finer point in the evaluation of our method, we present qualitative results by visualizing the predicted motion sequence(see Fig. 5). We provide examples of the actions “basketball-signal” and “directing-traffic” in CMU. We compared the model of GRU and attention mechanism co-fusion with the fusion block model without GRU(w/o GRU) and attention mechanism(w/o Att), respectively. The fusion block can generate more realistic body movements by using both GRU and attention mechanisms, as can be seen from the frames highlighted in red boxes.

Fig. 5.

Fig. 5

Visualization of predicted poses on two samples of CMU Mocap.

Size of N in the fusion block

High-dimensional information from encoder-1 and encoder-2, after being fused by linear layer and pooling layer, will be used as GRU input together with the first N frames of high-dimensional information in encoder-3. When N frames are input to the GRU, then Inline graphic frames are input to the subsequent attention propagation layer, where T is the dimension of the time dimension. For short-term prediction (10 frames input, 10 frames output, and 20 frames time dimension of high-dimensional information). Due to the high-dimensional information with a time dimension of 20 frames, the range of N is 0-20. When Inline graphic, only the attention mechanism is used without GRU, and when Inline graphic, the opposite is true. These two cases can be seen in the ablation experiments on architecture. So we set Inline graphic, Inline graphic, and Inline graphic as three sets of experiments. Similarly, for long-term prediction (input 10 frames, output 25 frames, the time dimension of high-dimensional information is 35 frames), we set Inline graphic, Inline graphic, Inline graphic, and Inline graphic in four groups of experiments. See Table 7 for the experimental results. For a visualization of the experiment see Fig. 7, Inline graphic is the baseline. We subtract the prediction errors of Inline graphic from those of the compared models. In Fig. 7a, we have shown the visualization of the three groups of experiments. Combined with the data in Table 7, it can be seen that for short-term prediction, the average prediction error is the highest when Inline graphic, which is 35.13. When Inline graphic and Inline graphic, the mean prediction error is 33.84 and 33.65, respectively, with little difference. For long-term prediction, see Fig. 7b for our visualization. When Inline graphic, the average prediction error is much lower than other groups. Combined with the prediction error results of long and short term, we finally choose 15 as the size of N.

Table 7.

Ablation of N in fusion module. The best results are highlighted in bold, and the second best is marked by underline.

80 ms 160 ms 320 ms 400 ms 560 ms 720 ms 880 ms 1000 ms Short-average Long-average General average
Inline graphic 10.45 23.67 48.21 58.21 35.13
Inline graphic 10.12 21.43 46.64 57.18 77.03 91.03 101.99 110.37 33.84 95.10 64.47
Inline graphic 9.85 20.86 Inline graphic Inline graphic 75.86 89.85 101.21 109.09 33.65 94.00 63.82
Inline graphic 77.43 91.84 102.85 111.54 95.91
Inline graphic 79.26 93.91 105.57 115.17 98.47
Fig. 7.

Fig. 7

(a). Experimental visualization of N ablation in short-term prediction. (b). Experimental visualization of N ablation in long-term prediction.

Conclusion

We propose a three-stage parallel prediction network and set two intermediate targets at both ends of the data based on the ground true value to guide the generation of prediction data. The key to the effectiveness of the framework is that we use a parallel network structure to reduce error accumulation, and can approach the ground truth from both ends based on two intermediate targets. In addition, we also propose a new fusion block based on GRU and attention mechanism. Using GRU to process short-term fusion can effectively apply the advantages of RNN for short-term prediction, and use attention mechanism to weaken the error accumulation characteristic of RNN. Finally, we use the high-dimensional information processed by the encoder as the input of the fusion block to further avoid information loss. We have carried out a lot of experiments and analyses, which proved the effectiveness and advantages of our method. In the future, we will further explore how to make better use of intermediate targets to get a better prediction framework.

Acknowledgements

This work was supported by National Natural Science Foundation of China under grant 61771322,62206178, Stable Support Plan for Higher Education Institutions in Shenzhen (Project No.20231121221536001) and the Fundamental Research Foundation of Shenzhen under Grant JCYJ20220531100814033.

Author contributions

J.Q.Z.: Investigation, Formal analysis, Software, Methodology; C.H.Y: Formal analysis, Software, Methodology; W.M.C: Supervision, Methodology. H.W.: Supervision, Methodology, Funding acquisition.

Data availability

All data generated or analyzed during this study are included in this paper. The data of this work is available on request from the authors. The data of H3.6M and CMU Mocap are also available in the public repository.

Competing interests

We have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Jianqi Zhong and Conghui Ye.

References

  • 1.Gao, Z. et al. A pairwise attentive adversarial spatiotemporal network for cross-domain few-shot action recognition-r2. IEEE Trans. Image Process. 30, 767–782 (2020). [DOI] [PubMed] [Google Scholar]
  • 2.Ge, S., Zhao, S., Gao, X. & Li, J. Fewer-shots and lower-resolutions: Towards ultrafast face recognition in the wild. In Proceedings of the 27th ACM International Conference on Multimedia, 229–237 (2019).
  • 3.Gui, L.-Y. et al. Teaching robots to predict human motion. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 562–567 (IEEE, 2018).
  • 4.Liu, L. et al. Computing systems for autonomous driving: State of the art and challenges. IEEE Internet Things J. 8, 6469–6486 (2020). [Google Scholar]
  • 5.Huang, D.-A. & Kitani, K. M. Action-reaction: Forecasting the dynamics of human interaction. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VII 13, 489–504 (Springer, 2014).
  • 6.Liu, Q., Liu, Z., Xiong, B., Xu, W. & Liu, Y. Deep reinforcement learning-based safe interaction for industrial human-robot collaboration using intrinsic reward function. Adv. Eng. Inform. 49, 101360 (2021). [Google Scholar]
  • 7.Liu, R. & Liu, C. Human motion prediction using adaptable recurrent neural networks and inverse kinematics. IEEE Control Syst. Lett. 5, 1651–1656 (2020). [Google Scholar]
  • 8.Guo, X. & Choi, J. Human motion prediction via learning local structure representations and temporal dependencies. In Proceedings of the AAAI Conference on Artificial Intelligence 33, 2580–2587 (2019).
  • 9.Cai, Y. et al. Learning progressive joint propagation for human motion prediction. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, 226–242 (Springer, 2020).
  • 10.Fragkiadaki, K., Levine, S., Felsen, P. & Malik, J. Recurrent network models for human dynamics. In Proceedings of the IEEE international conference on computer vision, 4346–4354 (2015).
  • 11.Martinez, J., Black, M. J. & Romero, J. On human motion prediction using recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2891–2900 (2017).
  • 12.Gui, L.-Y., Wang, Y.-X., Liang, X. & Moura, J. M. Adversarial geometry-aware human motion prediction. In Proceedings of the european conference on computer vision (ECCV), 786–803 (2018).
  • 13.Li, C., Zhang, Z., Lee, W. S. & Lee, G. H. Convolutional sequence to sequence model for human dynamics. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5226–5234 (2018).
  • 14.Arjovsky, M. & Bottou, L. Towards principled methods for training generative adversarial networks. arXiv preprint[SPACE]arXiv:1701.04862 (2017).
  • 15.Aksan, E., Kaufmann, M. & Hilliges, O. Structured prediction helps 3d human motion modelling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7144–7153 (2019).
  • 16.Cui, Q. & Sun, H. Towards accurate 3d human motion prediction from incomplete observations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4801–4810 (2021).
  • 17.Dang, L., Nie, Y., Long, C., Zhang, Q. & Li, G. Msr-gcn: Multi-scale residual graph convolution networks for human motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11467–11476 (2021).
  • 18.Ma, T., Nie, Y., Long, C., Zhang, Q. & Li, G. Progressively generating better initial guesses towards next stages for high-quality human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6437–6446 (2022).
  • 19.Tang, Y., Ma, L., Liu, W. & Zheng, W. Long-term human motion prediction by modeling motion context and enhancing motion dynamic. arXiv preprint[SPACE]arXiv:1805.02513 (2018).
  • 20.Liu, Z. et al. Towards natural and accurate future motion prediction of humans and animals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10004–10012 (2019).
  • 21.Ghosh, P., Yao, Y., Davis, L. & Divakaran, A. Stacked spatio-temporal graph convolutional networks for action segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 576–585 (2020).
  • 22.Yan, S., Xiong, Y. & Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-second AAAI conference on artificial intelligence (2018).
  • 23.Rizkallah, M., Su, X., Maugey, T. & Guillemot, C. Geometry-aware graph transforms for light field compact representation. IEEE Trans. Image Process. 29, 602–616 (2019). [DOI] [PubMed] [Google Scholar]
  • 24.You, J., Ying, R., Ren, X., Hamilton, W. & Leskovec, J. Graphrnn: Generating realistic graphs with deep auto-regressive models. In International conference on machine learning, 5708–5717 (PMLR, 2018).
  • 25.Shi, L., Zhang, Y., Cheng, J. & Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12026–12035 (2019).
  • 26.Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. arXiv preprint[SPACE]arXiv:1609.02907 (2016).
  • 27.Li, M. et al. Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 214–223 (2020).
  • 28.Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint[SPACE]arXiv:1409.0473 (2014).
  • 29.Xiong, C., Merity, S. & Socher, R. Dynamic memory networks for visual and textual question answering. In International conference on machine learning, 2397–2406 (PMLR, 2016).
  • 30.Aksan, E., Kaufmann, M., Cao, P. & Hilliges, O. A spatio-temporal transformer for 3d human motion prediction. In 2021 International Conference on 3D Vision (3DV), 565–574 (IEEE, 2021).
  • 31.Li, M. et al. Skeleton graph scattering networks for 3d skeleton-based human motion prediction. In Proceedings of the IEEE/CVF international conference on computer vision, 854–864 (2021).
  • 32.Zhou, T. et al. Sktformer: A skeleton transformer for long sequence data. In The Eleventh International Conference on Learning Representations (ICLR, 2023).
  • 33.Ibh, M., Grasshof, S., Witzner, D. & Madeleine, P. Tempose: a new skeleton-based transformer model designed for fine-grained motion recognition in badminton. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5199–5208 (2023).
  • 34.Mao, W., Liu, M., Salzmann, M. & Li, H. Learning trajectory dependencies for human motion prediction. In Proceedings of the IEEE/CVF international conference on computer vision, 9489–9497 (2019).
  • 35.Fu, J., Yang, F., Dang, Y., Liu, X. & Yin, J. Learning constrained dynamic correlations in spatiotemporal graphs for motion prediction. IEEE Transactions on Neural Networks and Learning Systems (2023). [DOI] [PubMed]
  • 36.Ionescu, C., Papava, D., Olaru, V. & Sminchisescu, C. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence 36, 1325–1339 (2013). [DOI] [PubMed]
  • 37.Von Marcard, T., Henschel, R., Black, M. J., Rosenhahn, B. & Pons-Moll, G. Recovering accurate 3d human pose in the wild using imus and a moving camera. In Proceedings of the European conference on computer vision (ECCV), 601–617 (2018).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All data generated or analyzed during this study are included in this paper. The data of this work is available on request from the authors. The data of H3.6M and CMU Mocap are also available in the public repository.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES