Parallel multi-stage rectification networks for 3D skeleton-based motion prediction

Jianqi Zhong; Conghui Ye; Wenming Cao; Hao Wang

doi:10.1038/s41598-024-75782-7

. 2024 Oct 30;14:26058. doi: 10.1038/s41598-024-75782-7

Parallel multi-stage rectification networks for 3D skeleton-based motion prediction

Jianqi Zhong ^1,^2,^#, Conghui Ye ^1,^2,^#, Wenming Cao ^1,², Hao Wang ^1,^2,^✉

PMCID: PMC11522317 PMID: 39472613

Abstract

It is noted that Recurrent Neural Networks (RNNs), which are widely used in human prediction tasks, have achieved promising performance in motion prediction, owing to RNNs’ robust capacity for spatial-temporal sequence modeling. However, RNN-based methods suffer from error accumulation due to their step-by-step prediction mechanism. Therefore, in this paper, we propose a three-stage parallel prediction network, which guides the output generation of these three networks with different objectives. In particular, we leverage the high-dimensional information in these three networks to fuse new information to generate the final output. In addition, we also designed a fusion block based on GRU and attention mechanism to extract high-dimensional information more efficiently. Extensive experiments show that our approach outperforms most recent methods in both short and long-term motion predictions on Human 3.6M, CMU Mocap, and 3DPW.

Keywords: Recurrent neural network, Error accumulation, Motion prediction

Subject terms: Engineering, Electrical and electronic engineering, Mechanical engineering

Introduction

3D human motion prediction is based on a person’s historical movement trend to predict their future movements. This task helps us to analyze future changes in response to motion trends in advance. It has very important applications in automatic motion generation^1–3, autonomous driving⁴, human-computer interaction^5–7.

Because human motion sequences progress in chronological order, Recurrent Neural Networks(RNN)^8–11, recognized for their effectiveness in managing tasks that involve predicting sequences, naturally stand out as a classic method for tackling this particular task. In the context of RNN-based techniques, there’s a tendency for convergence towards static behaviors in the medium to long run, often accompanied by challenges related to abrupt shifts and the accumulation of errors, possibly arising from the complexities involved in training RNNs. Some efforts to tackle this problem propose the application of Generative Adversarial Networks (GANs)^12,13 for modeling the human prediction task. Nonetheless, these models frequently present difficulties in the training process. The latest studies reveal that the attention mechanism within the Transformer can effectively capture the long-term relationships within time series, amplifying the model’s ability to detect dynamic actions and consequently yielding predictions that are closely aligned with the temporal context. The idea of attention¹⁴ is used to better train the model’s sensitivity to dynamic motion and generate predictive values with high temporal correlation. The attention-based model can compare the action in the last observable frame with each pose in the history sequence to model the current action and history action. Although attention models have recently excelled in long-term predictions, they may not be as good as RNNS at capturing short-term local features due to the nature of their global attention. Nowadays, Graph Convolutional Networks(GCN) have been widely used in various fields and human prediction tasks. A lot of work shows that GCN is very suitable for solving Human Motion Prediction(HMP) problems^15–17. They regard each joint of the human skeleton in the data set as a node of the figure and build edges between the joints. Then GCN is used to capture the spatial relationship between joints, which is conducive to attitude prediction. In constructing deep learning networks based on GCN,¹⁸ adopts a “series progression” network structure, using intermediate targets to make the predicted values closer to the true ground values, and achieved quite good results. However, this recursive network structure has similar characteristics to RNN networks, in that the output of the previous stage based on the intermediate goal is used as the input for the next stage. Because the output of the previous stage is not necessarily correct, the input for the next stage is not the true value, which usually leads to error accumulation.

To solve the problem of error accumulation, we abandon the recursive network structure and propose a three-level parallel correction network, that is, based on the ground truth, we use two different operations to generate two different sequences of actions as two intermediate targets, and then use these two intermediate targets for collaborative prediction. Our method can correct the predicted value from both ends and avoid error accumulation. Additionally, in order not to miss any information, the output of the encoder is used, namely the information in the high dimensional space (hereinafter referred to as the high dimensional information) for fusion correction. For better fusion correction, we propose a fusion block to fuse high-dimensional information of three encoders. A fusion block includes a short-term fusion block based on Gated Recurrent Unit (GRU) and a long-term information diffusion block based on attention mechanism. However, due to the recursive structure of GRU, which is a special type of RNN, it inevitably accumulates errors, resulting in a decrease in performance in long-term prediction. To reduce the error accumulation characteristic, we add an attention mechanism to GRU, and GRU and attention mechanism work together. The attention mechanism captures short-term high-dimensional features based on GRU and fuses long-term high-dimensional features to capture the long-term dependencies of the sequence. In this way, the fusion block can not only have the advantages of GRU in capturing short-term features but also reduce the error accumulation characteristics of RNN through attention mechanism, improving the ability to capture long-term features.

In short, the main contributions of this paper are as follows:

We propose a three-stage parallel prediction network, which takes two kinds of action sequences based on real value as the intermediate target, and on this basis, the predicted value approaches the real value from both ends. The two intermediate target networks work together to correct the predicted value for higher accuracy.
We propose a fusion block based on the collaboration of GRU and attention mechanism. The attention mechanism is used to alleviate the error accumulation problem of GRU, which can effectively capture the global features of action sequences.
Extensive experiments are conducted to suggest that our method outperforms previous approaches by large margins on three public datasets.

Related work

RNNs-based methods(recurrent neural networks)

Based on 3D skeleton has attracted significant attention in recent years and is an important research topic of practical importance. Among human motion prediction, the one based on deep learning techniques is at the forefront. Typical approaches represent human motion as a learning problem from sequence to sequence (seq2seq)^10,11,19, and specifically, RNN has been proposed to capture temporal information about human motion. To solve the problems of gradient blow-up and gradient disappearance in RNN, researchers have proposed two models 3-layer long short-term memory (LSTM-3LR) and encoder-recurrent-decoder (ERD)¹⁰. Between them, LSTM is used to extract long-term correlation. The above models are action-specific models, and there is usually a significant discontinuity between the first and the last frame of the prediction. Besides, researchers have also proposed various variants of RNN, such as the hierarchical motion recurrent model²⁰ and Verso-Time Label Noise-RNN²¹, etc. The utilization of RNN as a framework for sequential computation unavoidably gives rise to the accumulation of errors.

GCN-based methods(graphs convolutional networks)

Graphs effectively represent a large amount of non-grid structured data and explicitly depict the correlation of vertices^22,23.GCN as an extension of CNN, is suitable for data with specific graph structures, e.g., social networks²⁴ and 3D skeleton data^21,25. In recent years, many scholars have investigated graph neural networks (GNNs), extending deep networks to the graph domain. GNNs use a hierarchical architecture and end-to-end training. Lebailly et al.¹³ use GCN as an encoder and GCN to decode the aggregated features. Graph Convolutional Networks (GCN)²⁶ simplifies ChebyNet and combines spectral analysis and spatial operations. Feature aggregation on the graph of GCN is designed directly from the vertex perspective, similar to the convolution on an image. Researchers have also proposed various variants of GNN such as DMGNN(Dynamic Multiscale Graph Neural Networks)²⁷. DMGNN divides the human skeleton into three scales, and for different levels of features, the multiscale graph presents dynamic changes. By learning the integrated multi-scale feature representation, DMGNN helps achieve a more accurate future motion prediction. Ma et al.¹⁸ used a spatiotemporal GCN to build a model to obtain more accurate long-term predictions by predicting the median value of human motion. We also build encoders and decoders with GCN as the basic module.

Attention-based methods

Compared with RNNS with error accumulation characteristics, Bahdanau et al.²⁸ proposed self-attention, which allows the capture of the long-term dependence between input and output sequences, and, to a certain extent, reduces error accumulation. Xiong et al.²⁹ proposed an attention-based GRU model that utilizes memory networks to enable information to flow between sentences. Aksan et al.³⁰ proposed spatial and temporal self-attention blocks, which used attention mechanisms to extract spatiotemporal features in motion sequences, and then aggregated the most informative components to generate predictive results. Li et al.³¹. combined DCT with the attention mechanism, first transforming the human motion sequence into the frequency domain through DCT, then using the attention mechanism to evaluate the feature gravity in the frequency domain, and finally converting the results back to the time domain for output. In addition, Zhou et al.³² proposed SKTformer, which successfully reduces computational complexity when using attention mechanism for long sequence modeling. M lbh et al. proposed a novel skeleton-based transformer model called Tempose³³. Tempose employs multiple temporal and interaction layers to effectively capture human behavior while minimizing dependence on non-human visual environments. As a result, it offers significant versatility in its applications.

In this paper, we take into account both the error accumulation of RNN networks and the relatively weak ability of the attention mechanism to capture local features and propose a fusion module of GRU and attention mechanism cooperation to integrate short-term and long-term information. Our model outperforms most models in both short-term and long-term predictions on Human3.6M, CMU-MoCap, and 3DPW datasets.

Methods

We assume that the pose based on the historical motion sequence is Inline graphic : , consisting of consecutive human poses, where denotes a pose at frame i. And denotes the future pose sequence of length . Our goal then is to predict the poses for the future time steps. Following¹⁸, we repeated the last observed pose , times and append them to . The padded input sequence Inline graphic of length L with has been acquired. Accordingly, our goal has been shifted to finding a mapping from the padded sequence to its ground truth . Our model aims to train better .

Here we propose a Three-Stage Parallel Rectification Networks (PMRNet) for motion prediction. Figure 1 illustrates the architecture of the proposed PMRNet. Given the padded input Inline graphic : , we first pass it through three GCN-based encoders, En1, En2, and En3, each producing a set of latent features , , and , respectively. The three encoders are all residual blocks composed of several GCLs and convolution, and their structures are the same. The decoders De1 and De2 then utilize Inline graphic and to generate two outputs, and . The three decoders have the same structure, consisting of several GCLs and convolution as residual blocks, and the final result is output through an ST-GCN. The specific structure of the encoder and decoder is described in “Encoder-copy-decoder stage” below.

Fig. 1 — Overview of our parallel multi-stage human motion prediction framework containing 3 stages, it is worth noting that from top to bottom, Stage-1, Stage-3, and Stage-2 contain the corresponding number encoder and decoder respectively. Each stage takes the padded sequence derived from a truth sequence as input. Stage-3 is guided by the ground truth, while the remaining two stages are guided by the corresponding results of different calculations based on the ground truth. The encoder-decoder prediction networks used in the three stages are the same while the difference is that Stage-3 has one more high-dimensional information fusion block than Stage-1 and Stage-2. Please refer to the main text for more details.

To effectively integrate the information from Inline graphic , , and , we introduce a fusion block. This fusion block takes the three latent feature sets(, , and ) as input and outputs , which encapsulates the combined information from the three layers. The details of the fusion block are elaborated in “Fusion block based on GRU(gated recurrent unit) and attention mechanism”. Finally, the output Inline graphic is fed into De3 to generate the output . The generation of is directly guided by the ground truth, while the generation of and is guided by sequences derived from the ground truth, as detailed in “Parallel multi-stage rectification networks”.

Parallel multi-stage rectification networks

To achieve the above objectives, we design a parallel multi-stage rectification networks prediction framework as shown in Fig. 1, it consists of three stages, noting that from top to bottom it is Stage-1, Stage-3, and Stage-2 ( Inline graphic , , stand for Stage-1, Stage-3, Stage-2 respectively). The input of these stages is the padded input , and these stages perform the following tasks respectively:

in which Inline graphic stands for extracting the high-dimensional information of and .We will introduce the operation of the below.

Based on this work¹⁸, we perform different operations on Inline graphic to get and , and use and as the targets of the corresponding stage networks and to guide the generation of and respectively. To guide the generation of , we use . The operation for is Accumulated Average Smoothing (AAS) and the inverse operation of AAS (I-AAS) is introduced in the following(See Fig. 2).

Fig. 2 — AAS and I-AAS are explained from the physical point of view. The gray skeleton diagram is the true value, the red skeleton diagram is the result after the I-AAS operation on the gray skeleton diagram, and the green skeleton diagram is the result after the AAS operation on the gray skeleton diagram. On the right side of the figure, the arm is taken as an example. I-AAS operation is equivalent to increasing the motion amplitude of the arm, while AAS operation is equivalent to reducing the motion amplitude of the arm.

Suppose an action sequence Inline graphic has M nodes in the D-dimensional space, and each trajectory is composed of all coordinates of the same node in the action sequence: , then there are trajectories in the action sequence:. Since the processing method for each trajectory in this paper is the same, the subscript j is omitted in the following formula.

According to the above, the trajectory contains two parts: the historical part Inline graphic and the future part . We need to do the future part and keep the history part the same, the AAS algorithm is defined as

In our understanding, on a physical level, the AAS operation is equivalent to reducing the motion range of the pose. On the contrary, the inverse operation of AAS(I-AAS) increases the amplitude of attitude action. The I-AAS algorithm is defined as:

We apply I-AAS and AAS operations on Inline graphic respectively to get and , so that different stages can capture action characteristics of different action amplitude.

Encoder-copy-decoder stage

In this section, we introduce our network that fulfills the prediction task at each stage, the overview of which is illustrated in the upper middle of Fig. 1. These stages are composed of three ST-GCN-based encoders and three ST-GCN-based decoders with Stage-3 having an additional fusion block for integrating characteristic information of different motion amplitudes. In the following, we introduce them one by one.

For attitude prediction tasks, according to [20,21], GCN can effectively extract the spatial features of the attitude, and according to [22], GCN can also extract the temporal features of the attitude in the time dimension. In this paper, we use Spatio-Temporal combination GCN (ST-GCN) to directly extract the spatio-temporal characteristics of each joint in the pose. ST-GCN is used in the GCL of each encoder and decoder. Let’s go through the components one by one.

ST-GCN

Let Inline graphic be a pose sequence where L is the length of the sequence, M is the number of joints of a pose, and F indicates the number of features of a joint. We define two learnable adjacency matrices and respectively, where the measures the relationship between the joint pairs of a pose, and the Inline graphic extracts the information of the joint trajectory. ST-GCN computes:

where Inline graphic ,indicates the learnable parameters of ST-GCN, represents the first and second dimensions of the transformation input, and finally is the output of ST-GCN.

GCL

As shown on the top right of Fig. 1, a GCL indicates that the inputs are sequentially entered into the ST-GCN, batch normalization, tanh, and dropout. GCL is used to extract the global spatio-temporal features of the pose sequence.

Encoder

As shown at the top of Fig. 1, The encoder is a residual block composed of Inline graphic convolution layer and multiple GCL. The first GCL maps the input X from () to the feature space . We set in this paper. In addition, we use a convolution layer with F() kernels to map the input X to the feature space and add it to the output of the encoder as the residual connection of the encoder.

Copy

The feature information of dimension Inline graphic encoded by the encoder is copied along the time dimension to obtain the feature map of size , and the obtained feature map is used as the input of the decoder and fusion block. According to¹⁸, the operation of “copy” can increase the size of the feature space and improve the predictive performance of the model.

Decoder

The decoder is a residual block consisting of multiple GCLs, an ST-GCN, and a Inline graphic convolution layer. The GCLs extract temporal and spatial information from the feature space of , and since the “copy” operation doubles the time dimension of the hidden layer in the decoder, the size of the adjacency matrix in the decoder’s GCLS is . Finally, the high-dimensional features are projected back into the attitude space( Inline graphic ) through two ST-GCN. At the same time, we use a convolution with 3 kernels to operate on the decoder’s input to get its residual connection to the decoder’s output. The result length of the decoder is 2L, and we only keep the previous L pose as the final result.

Fusion block based on GRU(Gated Recurrent Unit) and attention mechanism

To fuse high-dimensional information across phases, we propose a fusion block based on GRU and attention mechanism. The function of the fusion block in this paper is to fuse and extract the high-dimensional features extracted by the three encoders. The output of the fusion block will be used as the input of decoder-3 after copy operation. For the convenience of the following, the outputs of encoder-1, encoder-2, and encoder-3 are represented successively as Inline graphic , , and . The sizes of , and are all .

See Fig. 3, in this paper, the fusion of three high-dimensional information is divided into two steps. The first step is to extract the first N frame Inline graphic of and input it into GRU together with and to get output . Then overwrites the first N frames of , i.e. . The second step is to input the high-dimensional information integrated with the first step into the diffusion block based on the attention mechanism to obtain new high-dimensional information, and the final output contains the high-dimensional information of the encoder 1-3. Next, we will introduce these two steps in details.

Short-term information fusion based on GRU

Before we input the high-dimensional information into the GRU, Inline graphic and are first integrated through the linear layer and the pooled layer. The formula is shown below:

where Inline graphic , are trainable linear mappings. is the temporal average pooling layer. . is spliced in space dimension. The spliced dimension of and is . Finally,.

GCN-based GRU(G-GRU). Details of the module are shown in the GRU on the top left of Fig. 3. The functionality of a GCN-based GRU(G-GRU) is to learn and update hidden states with the guide of high dimensional information based on Inline graphic and . Let , when , G-GRU takes two inputs:the initialstate ,and frame t in , which is .Then G-GRU() works as

where Inline graphic and are trainable linear mappings; is the ST-GCN mentioned in part 3.2 of this paper.

Finally, we spliced all the outputs of GRU along the time dimension :

where Inline graphic , N is a super parameter.For each G-GRU cell, it applies an ST-GCN on the hidden states for information propagation and generates high-dimensional information for the next frame.

Long-term information diffusion based on the attention mechanism

To avoid the characteristics of RNN error accumulation and diffuse the high-dimensional information of N frames fused in GRU to post Inline graphic frames, we propose a diffusion block based on the attention mechanism to generate high-dimensional information of frames once. Finally, we spliced the high-dimensional information of N frames extracted by GRU and the high-dimensional information of frames after attention diffusion along the time dimension as high-dimensional information after fusion.

As shown in Attention on the right of Fig. 3, queries and keys are used to calculate attention scores and then combine the corresponding values as weights. To do this, we first map the query and key to a vector space of dimension Inline graphic through two functions: and modeled with neural networks, where .This can be expressed as:

The input of Inline graphic is output in the GRU module, and the input of is the last frame of output of encoder-3. . Then, the attention scores of q and k are:

where Inline graphic . Finally, we multiply the attention score by the last frames of to get V through the full connection layer:

where Inline graphic is the output of Attention.The final input to decoder-3 is .

Experiments

Datasets

Human3.6 Human3.6M is a large public data set for 3D human pose estimation research[23]. There are 3.6 million 3D human poses and corresponding images in the Human3.6M dataset, with 7 experimental subjects (S1, S5, S6, S7, S8, S9, and S11) and 17 action scenarios, such as discussion, eating, movement, greeting, and other actions. Each pose has 32 exponential map joints. Similar to¹⁸, We convert them to 3D coordinate representations and discard 10 redundant joints. Global rotations and shifts of postures are excluded. The S5 and S11 are for testing and verification, respectively, while the rest are for training.

CMU-MoCap The CMU-MoCap consists of 5 categories of general movements:’human interaction’, ’interaction with the environment’, ’movement’, ’physical activity movement’, and ’situations scenarios’³⁶. Each subject had 38 joints. There are generally 8 representative action categories in CMU-MoCap, including actions such as jumping, backing, running, sitting down, standing up, pickup, basketball, and cartwheels. Each pose contains 38 exponential map format joints, which are also translated into 3D coordinate representations. The global rotations and shifts of postures are also excluded. As per^17,34, we keep 25 joints and discard the others. The division of training and test data sets is also the same as^17,34.

3DPW 3DPW³⁷ is a challenging dataset containing human movements captured from indoor and outdoor scenes. The postures in this dataset are represented in 3D space. Each pose contains 26 joints, 23 of which are used (the other 3 are redundant)

Comparison settings

Evaluation metrics

Mean Per Joint Position Error (MPJPE) in millimeters is the most widely used evaluation metric. Assume that the predicted pose sequence is Inline graphic and the corresponding ground truth is . Then the MPJPE loss is

where Inline graphic represents the predicted position of the j-th joint in frame t. Then is the actual position of the ground.

Lengths of input and output sequences

Human3.6M and CMU-MoCap have input lengths of 10 and output lengths of 25. 3DPW have input lengths of 10 and output lengths of 30.

Implementation details

There are 3 stages in our entire network. The encoder contains 2 GCLs and the decoder contains 4 GCLs. The framework contains 18 GCLs in total. N in the fusion block is equal to 15 in the long prediction and 10 in the short prediction. We employ Adam as the solver. The learning rate is initially 0.005 and multiplied by 0.96 after each epoch. The model is trained for 50 epochs with a batchsize of 32. The devices we used are an NVIDIA GeForce RTX 2060 GPU and an Intel(R) Core(TM) i7-10700 CPU.

Comparisons with previous approaches

We compare our method with Res. Sup.¹¹, DMGNN²⁷, LTD³⁴, MSR¹⁷, PGBIG¹⁸, and DS-GCN³⁵ on these three datasets. Res. Sup. is an early RNN-based approach. DMGNN uses GCN for feature extraction and RNN variant GRU for decoding. LTD is completely dependent on GCN and performs predictions in the frequency domain. MSR is the most recent way to implement LTD in a multi-scale manner. PGBIG is a method to build a network based on GCN and set intermediate targets. It is also the most advanced method to build a network by using intermediate targets at present. DS-GCN is one of the latest methods based on GCN. All these methods are previous state-of-the-art which release their code publicly.

Human3.6M. Table 1 shows the quantitative comparisons of short-term prediction (less than 400 ms) on Human3.6M between our method and the above four approaches. Table 2 shows the comparisons of long-term prediction (more than 400 ms but less than 1000 ms) on Human3.6M. In most cases, our results are better than those of the compared methods. In Fig. 4a, we show and compare the performance of the different approaches using statistics. In Fig. 4a, We take PGBIG as the baseline and subtract the prediction error of MSR, LTD, and our method from the prediction error of PGBIG. In Fig. 4a, the relative average prediction errors concerning PGBIG at every future timestamp are plotted. As can be seen, methods MSR and LTD are much less effective than PGBIG. And the effect of LTD and MSR is not much different. Our method outperformed PGBIG in all predictive timestamps, and our advantage was most significant at 160 ms.

Table 1.

Comparisons of short-term prediction on Human3.6 M. Results at 80 ms, 160 ms, 320 ms, 400 ms in the future are shown.

Scenarios	Walking				Eating				Smoking				Discussion
Millisecond	80 ms	160 ms	320 ms	400 ms	80 ms	160 ms	320 ms	400 ms	80 ms	160 ms	320 ms	400 ms	80 ms	160 ms	320 ms	400 ms
Res. Sup.¹¹	29.4	50.8	76.0	81.5	16.8	30.6	56.9	68.7	23.0	42.6	70.1	82.7	32.9	61.2	90.9	96.2
DMGNN²⁷	17.3	30.7	54.6	65.2	11.0	21.4	36.2	43.9	9.0	17.6	32.1	40.3	17.3	34.8	61.0	69.8
LTD³⁴	12.3	23.0	39.8	46.1	8.4	16.9	33.2	40.7	7.9	16.2	31.9	38.9	12.5	27.4	58.5	71.7
MSR¹⁷	12.2	22.7	38.6	45.2	8.4	17.1	33.0	40.4	8.0	16.3	31.3	38.2	12.0	26.8	57.1	69.7
PGBIG¹⁸					7.0								10.0
DS-GCN³⁵	11.0	22.3	38.8	45.1		15.5	31.7	39.1		14.7	29.7	36.6		24.3	54.5	67.4
Ours	9.4	17.7	32.8	38.7	6.8	13.1	30.2	36.7	6.3	12.1	27.8	33.6	10.0	22.1	52.7	64.5

Scenarios	Directions				Greeting				Phoning				Posing
Millisecond	80 ms	160 ms	320 ms	400 ms	80 ms	160 ms	320 ms	400 ms	80 ms	160 ms	320 ms	400 ms	80 ms	160 ms	320 ms	400 ms
Res. Sup.	35.4	57.3	76.3	87.7	34.5	63.4	124.6	142.5	38.0	69.3	115.0	126.7	36.1	69.1	130.5	157.1
DMGNN	13.1	24.6	64.7	81.9	23.3	50.3	107.3	132.1	12.5	25.8	48.1	58.3	15.3	29.3	71.5	96.7
LTD	9.0	19.9	43.4	53.7	18.7	38.7	77.7	93.4	10.2	21.0	42.5	52.3	13.7	29.9	66.6	84.1
MSR	8.6	19.7	43.3	53.8	16.5	37.0	77.3	93.4	10.1	20.7	41.5	51.3	12.8	29.4	67.0	85.0
PGBIG	7.2	17.6			15.2	34.1							10.7	25.7
DS-GCN	6.8			51.6	14.2		72.1	87.3	8.5	19.2	40.3	49.8			60.6	77.3
Ours		14.7	39.8	50.9		31.2	69.7	85.7	8.2	17.4	38.5	47.6	9.1	23.6	58.7	74.8

Scenarios	Purchases				Sitting				Sittingdown				Takingphoto
Millisecond	80 ms	160 ms	320 ms	400 ms	80 ms	160 ms	320 ms	400 ms	80 ms	160 ms	320 ms	400 ms	80 ms	160 ms	320 ms	400 ms
Res. Sup.	36.3	60.3	86.5	95.9	42.6	81.4	134.7	151.8	47.3	86.0	145.8	168.9	26.1	47.6	81.4	94.7
DMGNN	21.4	38.7	75.7	92.7	11.9	25.1	44.6	50.2	15.0	32.9	77.1	93.0	13.6	29.0	46.0	58.8
LTD	15.6	32.8	65.7	79.3	10.6	21.9	46.3	57.9	16.1	31.1	61.5	75.5	9.9	20.9	45.0	56.6
MSR	14.8	32.4	66.1	79.6	10.5	22.0	46.3	57.8	16.1	31.6	62.5	76.8	9.9	21.0	44.6	56.3
PGBIG					8.8		42.4	53.8			57.4	71.5		18.9
DS-GCN	12.6	29.6	62.2	75.9		19.3	42.8	54.3	14.1	28.0						53.5
Ours	11.5	26.6	59.6	72.5	8.4	17.9			13.0	25.5	56.2	69.8	8.3	17.8	41.5	52.5

Scenarios	Waiting				Walkingdog				Walkingtogether				Average
Millisecond	80 ms	160 ms	320 ms	400 ms	80 ms	160 ms	320 ms	400 ms	80 ms	160 ms	320 ms	400 ms	80 ms	160 ms	320 ms	400 ms
Res. Sup.	30.6	57.8	106.2	121.5	64.2	102.1	141.1	164.4	26.8	50.1	80.2	92.2	34.7	62.0	101.1	115.5
DMGNN	12.2	24.2	59.6	77.5	47.1	93.3	160.1	171.2	14.3	26.7	50.1	63.2	17.0	33.6	65.9	79.7
LTD	11.4	24.0	50.1	61.5	23.4	46.2	83.5	96.0	10.5	21.0	38.5	45.2	12.7	26.1	52.3	63.5
MSR	10.7	23.1	48.3	59.2	20.7	42.9	80.4	93.3	10.6	20.9	37.4	43.9	12.1	25.6	51.6	62.9
PGBIG	8.9								8.7
DS-GCN			44.2	55.2	19.6	41.8	77.6	90.2	9.0	19.7	36.3	42.6		23.3	48.7	59.8
Ours	8.0	18.7	42.9	52.9	17.7	36.4	71.9	84.5		17.5	34.0	40.4	9.8	20.8	46.6	57.2

Open in a new tab

The best results are highlighted in bold, and the second best is marked by underline.

Table 2.

Comparisons of long-term prediction on Human3.6 M. Results at 560 ms and 1000 ms in the future are shown. The best results are highlighted in bold, and the second best is marked by underline.

Scenarios	Walking		Eating		Smoking		Discussion		Directions		Greeting		Phoning		Posing
millisecond	560 ms	1000 ms	560 ms	1000 ms	560 ms	1000 ms	560m	1000 ms	560 ms	1000 ms	560 ms	1000 ms	560 ms	1000 ms	560 ms	1000 ms
Res. Sup.	81.7	100.7	79.9	100.2	94.8	137.4	121.3	161.7	110.1	152.5	156.1	166.5	141.2	131.5	194.7	240.2
DMGNN	73.4	95.8	58.1	86.7	50.9	72.2	81.9	138.3	110.1	115.8	152.5	157.7	78.9	98.6	163.9	310.1
LTD	54.1	59.8	53.4	77.8	50.7	72.6	91.6	121.5	71.0	101.8	115.4	148.8	69.2	103.1	114.5	173.0
MSR	52.7	63.0	52.5	77.1	49.5	71.6	88.6	117.6	71.2	100.6	116.3	147.2	68.3	104.4	116.3	174.3
PGBIG								118.2	69.3	100.4	110.2			102.7		164.8
DS-GCN	52.7	59.7	51.8	76.1	48.0	71.1	87.0	116.3			108.6	142.2	66.6	102.2	106.5	163.3
Ours	47.6	55.3	50.3	74.9	45.1	68.2	86.5		68.7	98.9		142.2	64.1		105.1

Scenarios	Purchases		Sitting		Sitting down		Taking photo		Waiting		Walking dog		Walking together		Average
Millisecond	560 ms	1000 ms	560 ms	1000 ms	560 ms	1000 ms	560 ms	1000 ms	560 ms	1000 ms	560 ms	1000 ms	560 ms	1000 ms	560 ms	1000 ms
Res. Sup.	122.7	160.3	167.4	201.5	205.3	277.6	117.0	143.2	146.2	196.2	191.3	209.0	107.6	131.1	97.6	130.5
DMGNN	118.6	153.8	60.1	104.9	122.1	168.8	91.6	120.7	106.0	136.7	194.0	182.3	83.4	115.9	103.0	137.2
LTD	102.0	143.5	78.3	119.7	100.0	150.2	77.4	119.8	79.4	108.1	111.9	148.9	55.0	65.6	81.6	114.3
MSR	101.6	139.2	78.2	120.0	102.8	155.5	77.9	121.9	76.3	106.3	111.9	148.2	52.9	65.9	81.1	114.2
PGBIG			74.4	116.1	96.7	147.8		118.6					51.9	64.3
DS-GCN	97.5	137.7	74.9	117.7			74.5		73.1	105.6	109.8	147.6		61.2	77.8	111.0
Ours	94.5	131.2			95.8	146.3	72.3	117.7	71.7	102.5	102.7	138.3	50.1		75.8	109.0

Open in a new tab

CMU-MoCap.We report the average MPJPE in the short-term predictions ( Inline graphic ms) and the long-term predictions ( ms) in Table 3. In Fig. 4b, We take PGBIG as the baseline and subtract the prediction error of MSR, LTD, and our method from the prediction error of PGBIG. The relative average prediction errors concerning PGBIG at every future timestamp are plotted. Our approach still outperformed the PGBIG in this dataset, and our advantage was most significant at 560 ms.

Table 3.

CMU-MoCap: comparisons of average prediction errors.

Millisecond	80 ms	160 ms	320 ms	400 ms	560 ms	1000 ms
Res. Sup.	24.0	43.0	74.5	87.2	105.5	136.3
DMGNN	13.6	24.1	47.0	58.8	77.4	112.6
LTD	9.3	17.1	33.0	40.9	55.8	86.2
MSR	8.1	15.2	30.6	38.6	53.7	83.0
PGBIG	7.6	14.3	29.0	36.6	53.7	80.1
DS-GCN
Ours	7.2	13.5	27.4	35.2	49.7	79.7

Open in a new tab

3DPW.We rep ort the average MPJPE in the short-term predictions ( Inline graphic ms) and the long-term predictions ( ms) in Table 4. In 4c, We take PGBIG as the baseline and subtract the prediction error of MSR, LTD, and our method from the prediction error of PGBIG. For this challenging data set, our approach works very well, especially for short-term prediction results, where our advantages are significant. Especially for the prediction results of 400 ms, the error of our method is 10.4 less than that of PGBIG.

Table 4.

3DPW: comparisons of average prediction errors.

Millisecond	200 ms	400 ms	600 ms	800 ms	1000 ms
Res. Sup.	113.9	173.1	191.9	201.1	210.7
DMGNN	37.3	67.8	94.5	109.7	123.6
LTD	35.6	67.8	90.6	106.9	117.8
MSR	37.8	71.3	93.9	110.8	121.5
PGBIG
Ours	22.9	47.9	75.9	92.8	102.4

Open in a new tab

Model complexity analysis

As shown in Table 5, our model has a smaller size compared to most mainstream methods. However, due to the addition of a fusion block, its size is larger than that of PGBIG.Since our network has three stages and requires two additional losses to be calculated, our model is slightly slower than LTD and PGBIG. If all fusion blocks use attention mechanism, the running speed will be improved, but the model size will slightly increase. If all fusion blocks use GRU, the running speed will decrease, but the model size will decrease.

Table 5.

Time cost and model size comparisons.

Method	Train (Per batch) (ms)	Test (Per batch) (ms)	Model size (M)
DMGNN	473	85	46.90
LTD	114	30	2.55
Res. Sup.	191	57	6.30
PGBIG	145	43	1.74
TemPose	–	–	3.8
PMRNet-w/o Att	250	156	1.48
PMRNet-w/o GRU	117	41	2.19
PMRNet(w/GRU+ATT)	157	52	2.12

Open in a new tab

Ablation studies

We conducted ablation experiments on a human3.6 dataset to further explore and analyze our methods.

Architecture

Several design choices help to explore the effectiveness of our approach:(1) the multi-stage learning framework, (2) the choice of intermediate targets, and (3) The composition of the fusion block. Table 6 shows the ablation experiments on different variants of the full model. The full model has 3 different goal-directed stages, a fusion block containing the GRU, and the attention mechanism. The average prediction error is 63.82. (1)To show the effectiveness of ’three stages’, we test the case when there is only stage-3 and when there are only two stages. When predicting the network stage-3 but with 18 GCLs, i.e. 6 GCLs for the encoder and 12 GCLs for the decoder, the average prediction error increases sharply to 67.99. When there is no stage-1 and stage-2(represented by “w/o stage-1” and “w/o stage-2” respectively), that is, stage-3 and the other stage have 3 GCLs each for the encoder and 6 GCLs each for the decoder the average prediction error of the model is 73.09 and 73.85 respectively. Thus, the “three stages” have a significant effect on the prediction effect. (2)In the next experiment, we use the ground truth (GT) to guide all the intermediate outputs, which yields a prediction error of 65.38 on average. (3) Finally, we removed the GRU(represented by “w/o GRU”) and the Attention(represented by “w/o ATT”) mechanism from the fusion block respectively. The mean prediction error for the absence of GRU and the attention mechanism is 68.45 and 67.04, respectively. Figure 6 is a visualization of Ablations on architecture. It can be seen from Fig. 6 that the absence of a single stage has the biggest impact on the model. When the Attention mechanism is missing, the prediction effect of the model becomes better in a short time, but worse in a long time, indicating that GRU plays a great role in short-time prediction, but there is also a problem of error accumulation. In the absence of a GRU module, the prediction effect of the model becomes worse in all stages, indicating that Attention has a certain similarity with GRU. When the phase output fully lets GT guide, the model effect also deteriorates, indicating that it is necessary to generate intermediate targets.The best results are highlighted in bold.

Table 6.

Ablations on architecture. The best results are highlighted in bold, and the second best is marked by underline.

	80 ms	160 ms	320 ms	400 ms	560 ms	720 ms	880 ms	1000 ms	Average
Only stage-3	11.97	25.17	50.3	61.24	80.5	94.57	106.23	113.98	67.99
/o Stage-2	15.92	31.85	65.21	72.49	83.24	96.46	108.73	116.97	73.85
w/o Stage-1	15.54	31.33	65.19	72.49	82.73	94.94	107.52	115.02	73.09
Guided by GT at all stages	10.88	22.23	48.74	59.97	76.43	91.04	102.52	111.27	65.38
w/o GRU in fusion block	14.93	24.78	54.38	64.65	80.51	93.53	103.59	111.27	68.45
w/o ATT in fusion block	9.43	20.17	45.47	58.12	79.92	96.22	109.57	117.43	67.04
PMRNet (w/GRU+ATT)				57.28	75.86	89.85	101.21	109.09	63.82

Open in a new tab

Fig. 6 — Visualization of ablation experiments on architecture. The full model is the baseline. We subtract the prediction errors of the full model from those of the compared models.

Qualitative visualization

To put a finer point in the evaluation of our method, we present qualitative results by visualizing the predicted motion sequence(see Fig. 5). We provide examples of the actions “basketball-signal” and “directing-traffic” in CMU. We compared the model of GRU and attention mechanism co-fusion with the fusion block model without GRU(w/o GRU) and attention mechanism(w/o Att), respectively. The fusion block can generate more realistic body movements by using both GRU and attention mechanisms, as can be seen from the frames highlighted in red boxes.

Size of N in the fusion block

High-dimensional information from encoder-1 and encoder-2, after being fused by linear layer and pooling layer, will be used as GRU input together with the first N frames of high-dimensional information in encoder-3. When N frames are input to the GRU, then Inline graphic frames are input to the subsequent attention propagation layer, where T is the dimension of the time dimension. For short-term prediction (10 frames input, 10 frames output, and 20 frames time dimension of high-dimensional information). Due to the high-dimensional information with a time dimension of 20 frames, the range of N is 0-20. When Inline graphic , only the attention mechanism is used without GRU, and when , the opposite is true. These two cases can be seen in the ablation experiments on architecture. So we set , , and as three sets of experiments. Similarly, for long-term prediction (input 10 frames, output 25 frames, the time dimension of high-dimensional information is 35 frames), we set Inline graphic , , , and in four groups of experiments. See Table 7 for the experimental results. For a visualization of the experiment see Fig. 7, is the baseline. We subtract the prediction errors of from those of the compared models. In Fig. 7a, we have shown the visualization of the three groups of experiments. Combined with the data in Table 7, it can be seen that for short-term prediction, the average prediction error is the highest when Inline graphic , which is 35.13. When and , the mean prediction error is 33.84 and 33.65, respectively, with little difference. For long-term prediction, see Fig. 7b for our visualization. When , the average prediction error is much lower than other groups. Combined with the prediction error results of long and short term, we finally choose 15 as the size of N.

Table 7.

Ablation of N in fusion module. The best results are highlighted in bold, and the second best is marked by underline.

80 ms	160 ms	320 ms	400 ms	560 ms	720 ms	880 ms	1000 ms	Short-average	Long-average	General average
10.45	23.67	48.21	58.21	–	–	–	–	35.13	–	–
10.12	21.43	46.64	57.18	77.03	91.03	101.99	110.37	33.84	95.10	64.47
9.85	20.86			75.86	89.85	101.21	109.09	33.65	94.00	63.82
–	–	–	–	77.43	91.84	102.85	111.54	–	95.91	–
–	–	–	–	79.26	93.91	105.57	115.17	–	98.47	–

Open in a new tab

Fig. 7 — (a). Experimental visualization of N ablation in short-term prediction. (b). Experimental visualization of N ablation in long-term prediction.

Conclusion

We propose a three-stage parallel prediction network and set two intermediate targets at both ends of the data based on the ground true value to guide the generation of prediction data. The key to the effectiveness of the framework is that we use a parallel network structure to reduce error accumulation, and can approach the ground truth from both ends based on two intermediate targets. In addition, we also propose a new fusion block based on GRU and attention mechanism. Using GRU to process short-term fusion can effectively apply the advantages of RNN for short-term prediction, and use attention mechanism to weaken the error accumulation characteristic of RNN. Finally, we use the high-dimensional information processed by the encoder as the input of the fusion block to further avoid information loss. We have carried out a lot of experiments and analyses, which proved the effectiveness and advantages of our method. In the future, we will further explore how to make better use of intermediate targets to get a better prediction framework.

Acknowledgements

This work was supported by National Natural Science Foundation of China under grant 61771322,62206178, Stable Support Plan for Higher Education Institutions in Shenzhen (Project No.20231121221536001) and the Fundamental Research Foundation of Shenzhen under Grant JCYJ20220531100814033.

Author contributions

J.Q.Z.: Investigation, Formal analysis, Software, Methodology; C.H.Y: Formal analysis, Software, Methodology; W.M.C: Supervision, Methodology. H.W.: Supervision, Methodology, Funding acquisition.

Data availability

All data generated or analyzed during this study are included in this paper. The data of this work is available on request from the authors. The data of H3.6M and CMU Mocap are also available in the public repository.

Competing interests

We have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Jianqi Zhong and Conghui Ye.

References

1.Gao, Z. et al. A pairwise attentive adversarial spatiotemporal network for cross-domain few-shot action recognition-r2. IEEE Trans. Image Process. 30, 767–782 (2020). [DOI] [PubMed] [Google Scholar]
2.Ge, S., Zhao, S., Gao, X. & Li, J. Fewer-shots and lower-resolutions: Towards ultrafast face recognition in the wild. In Proceedings of the 27th ACM International Conference on Multimedia, 229–237 (2019).
3.Gui, L.-Y. et al. Teaching robots to predict human motion. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 562–567 (IEEE, 2018).
4.Liu, L. et al. Computing systems for autonomous driving: State of the art and challenges. IEEE Internet Things J. 8, 6469–6486 (2020). [Google Scholar]
5.Huang, D.-A. & Kitani, K. M. Action-reaction: Forecasting the dynamics of human interaction. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VII 13, 489–504 (Springer, 2014).
6.Liu, Q., Liu, Z., Xiong, B., Xu, W. & Liu, Y. Deep reinforcement learning-based safe interaction for industrial human-robot collaboration using intrinsic reward function. Adv. Eng. Inform. 49, 101360 (2021). [Google Scholar]
7.Liu, R. & Liu, C. Human motion prediction using adaptable recurrent neural networks and inverse kinematics. IEEE Control Syst. Lett. 5, 1651–1656 (2020). [Google Scholar]
8.Guo, X. & Choi, J. Human motion prediction via learning local structure representations and temporal dependencies. In Proceedings of the AAAI Conference on Artificial Intelligence 33, 2580–2587 (2019).
9.Cai, Y. et al. Learning progressive joint propagation for human motion prediction. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, 226–242 (Springer, 2020).
10.Fragkiadaki, K., Levine, S., Felsen, P. & Malik, J. Recurrent network models for human dynamics. In Proceedings of the IEEE international conference on computer vision, 4346–4354 (2015).
11.Martinez, J., Black, M. J. & Romero, J. On human motion prediction using recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2891–2900 (2017).
12.Gui, L.-Y., Wang, Y.-X., Liang, X. & Moura, J. M. Adversarial geometry-aware human motion prediction. In Proceedings of the european conference on computer vision (ECCV), 786–803 (2018).
13.Li, C., Zhang, Z., Lee, W. S. & Lee, G. H. Convolutional sequence to sequence model for human dynamics. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5226–5234 (2018).
14.Arjovsky, M. & Bottou, L. Towards principled methods for training generative adversarial networks. arXiv preprint[SPACE]arXiv:1701.04862 (2017).
15.Aksan, E., Kaufmann, M. & Hilliges, O. Structured prediction helps 3d human motion modelling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7144–7153 (2019).
16.Cui, Q. & Sun, H. Towards accurate 3d human motion prediction from incomplete observations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4801–4810 (2021).
17.Dang, L., Nie, Y., Long, C., Zhang, Q. & Li, G. Msr-gcn: Multi-scale residual graph convolution networks for human motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11467–11476 (2021).
18.Ma, T., Nie, Y., Long, C., Zhang, Q. & Li, G. Progressively generating better initial guesses towards next stages for high-quality human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6437–6446 (2022).
19.Tang, Y., Ma, L., Liu, W. & Zheng, W. Long-term human motion prediction by modeling motion context and enhancing motion dynamic. arXiv preprint[SPACE]arXiv:1805.02513 (2018).
20.Liu, Z. et al. Towards natural and accurate future motion prediction of humans and animals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10004–10012 (2019).
21.Ghosh, P., Yao, Y., Davis, L. & Divakaran, A. Stacked spatio-temporal graph convolutional networks for action segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 576–585 (2020).
22.Yan, S., Xiong, Y. & Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-second AAAI conference on artificial intelligence (2018).
23.Rizkallah, M., Su, X., Maugey, T. & Guillemot, C. Geometry-aware graph transforms for light field compact representation. IEEE Trans. Image Process. 29, 602–616 (2019). [DOI] [PubMed] [Google Scholar]
24.You, J., Ying, R., Ren, X., Hamilton, W. & Leskovec, J. Graphrnn: Generating realistic graphs with deep auto-regressive models. In International conference on machine learning, 5708–5717 (PMLR, 2018).
25.Shi, L., Zhang, Y., Cheng, J. & Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12026–12035 (2019).
26.Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. arXiv preprint[SPACE]arXiv:1609.02907 (2016).
27.Li, M. et al. Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 214–223 (2020).
28.Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint[SPACE]arXiv:1409.0473 (2014).
29.Xiong, C., Merity, S. & Socher, R. Dynamic memory networks for visual and textual question answering. In International conference on machine learning, 2397–2406 (PMLR, 2016).
30.Aksan, E., Kaufmann, M., Cao, P. & Hilliges, O. A spatio-temporal transformer for 3d human motion prediction. In 2021 International Conference on 3D Vision (3DV), 565–574 (IEEE, 2021).
31.Li, M. et al. Skeleton graph scattering networks for 3d skeleton-based human motion prediction. In Proceedings of the IEEE/CVF international conference on computer vision, 854–864 (2021).
32.Zhou, T. et al. Sktformer: A skeleton transformer for long sequence data. In The Eleventh International Conference on Learning Representations (ICLR, 2023).
33.Ibh, M., Grasshof, S., Witzner, D. & Madeleine, P. Tempose: a new skeleton-based transformer model designed for fine-grained motion recognition in badminton. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5199–5208 (2023).
34.Mao, W., Liu, M., Salzmann, M. & Li, H. Learning trajectory dependencies for human motion prediction. In Proceedings of the IEEE/CVF international conference on computer vision, 9489–9497 (2019).
35.Fu, J., Yang, F., Dang, Y., Liu, X. & Yin, J. Learning constrained dynamic correlations in spatiotemporal graphs for motion prediction. IEEE Transactions on Neural Networks and Learning Systems (2023). [DOI] [PubMed]
36.Ionescu, C., Papava, D., Olaru, V. & Sminchisescu, C. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence 36, 1325–1339 (2013). [DOI] [PubMed]
37.Von Marcard, T., Henschel, R., Black, M. J., Rosenhahn, B. & Pons-Moll, G. Recovering accurate 3d human pose in the wild using imus and a moving camera. In Proceedings of the European conference on computer vision (ECCV), 601–617 (2018).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[CR1] 1.Gao, Z. et al. A pairwise attentive adversarial spatiotemporal network for cross-domain few-shot action recognition-r2. IEEE Trans. Image Process. 30, 767–782 (2020). [DOI] [PubMed] [Google Scholar]

[CR2] 2.Ge, S., Zhao, S., Gao, X. & Li, J. Fewer-shots and lower-resolutions: Towards ultrafast face recognition in the wild. In Proceedings of the 27th ACM International Conference on Multimedia, 229–237 (2019).

[CR3] 3.Gui, L.-Y. et al. Teaching robots to predict human motion. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 562–567 (IEEE, 2018).

[CR4] 4.Liu, L. et al. Computing systems for autonomous driving: State of the art and challenges. IEEE Internet Things J. 8, 6469–6486 (2020). [Google Scholar]

[CR5] 5.Huang, D.-A. & Kitani, K. M. Action-reaction: Forecasting the dynamics of human interaction. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VII 13, 489–504 (Springer, 2014).

[CR6] 6.Liu, Q., Liu, Z., Xiong, B., Xu, W. & Liu, Y. Deep reinforcement learning-based safe interaction for industrial human-robot collaboration using intrinsic reward function. Adv. Eng. Inform. 49, 101360 (2021). [Google Scholar]

[CR7] 7.Liu, R. & Liu, C. Human motion prediction using adaptable recurrent neural networks and inverse kinematics. IEEE Control Syst. Lett. 5, 1651–1656 (2020). [Google Scholar]

[CR8] 8.Guo, X. & Choi, J. Human motion prediction via learning local structure representations and temporal dependencies. In Proceedings of the AAAI Conference on Artificial Intelligence 33, 2580–2587 (2019).

[CR9] 9.Cai, Y. et al. Learning progressive joint propagation for human motion prediction. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, 226–242 (Springer, 2020).

[CR10] 10.Fragkiadaki, K., Levine, S., Felsen, P. & Malik, J. Recurrent network models for human dynamics. In Proceedings of the IEEE international conference on computer vision, 4346–4354 (2015).

[CR11] 11.Martinez, J., Black, M. J. & Romero, J. On human motion prediction using recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2891–2900 (2017).

[CR12] 12.Gui, L.-Y., Wang, Y.-X., Liang, X. & Moura, J. M. Adversarial geometry-aware human motion prediction. In Proceedings of the european conference on computer vision (ECCV), 786–803 (2018).

[CR13] 13.Li, C., Zhang, Z., Lee, W. S. & Lee, G. H. Convolutional sequence to sequence model for human dynamics. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5226–5234 (2018).

[CR14] 14.Arjovsky, M. & Bottou, L. Towards principled methods for training generative adversarial networks. arXiv preprint[SPACE]arXiv:1701.04862 (2017).

[CR15] 15.Aksan, E., Kaufmann, M. & Hilliges, O. Structured prediction helps 3d human motion modelling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7144–7153 (2019).

[CR16] 16.Cui, Q. & Sun, H. Towards accurate 3d human motion prediction from incomplete observations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4801–4810 (2021).

[CR17] 17.Dang, L., Nie, Y., Long, C., Zhang, Q. & Li, G. Msr-gcn: Multi-scale residual graph convolution networks for human motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11467–11476 (2021).

[CR18] 18.Ma, T., Nie, Y., Long, C., Zhang, Q. & Li, G. Progressively generating better initial guesses towards next stages for high-quality human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6437–6446 (2022).

[CR19] 19.Tang, Y., Ma, L., Liu, W. & Zheng, W. Long-term human motion prediction by modeling motion context and enhancing motion dynamic. arXiv preprint[SPACE]arXiv:1805.02513 (2018).

[CR20] 20.Liu, Z. et al. Towards natural and accurate future motion prediction of humans and animals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10004–10012 (2019).

[CR21] 21.Ghosh, P., Yao, Y., Davis, L. & Divakaran, A. Stacked spatio-temporal graph convolutional networks for action segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 576–585 (2020).

[CR22] 22.Yan, S., Xiong, Y. & Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-second AAAI conference on artificial intelligence (2018).

[CR23] 23.Rizkallah, M., Su, X., Maugey, T. & Guillemot, C. Geometry-aware graph transforms for light field compact representation. IEEE Trans. Image Process. 29, 602–616 (2019). [DOI] [PubMed] [Google Scholar]

[CR24] 24.You, J., Ying, R., Ren, X., Hamilton, W. & Leskovec, J. Graphrnn: Generating realistic graphs with deep auto-regressive models. In International conference on machine learning, 5708–5717 (PMLR, 2018).

[CR25] 25.Shi, L., Zhang, Y., Cheng, J. & Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12026–12035 (2019).

[CR26] 26.Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. arXiv preprint[SPACE]arXiv:1609.02907 (2016).

[CR27] 27.Li, M. et al. Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 214–223 (2020).

[CR28] 28.Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint[SPACE]arXiv:1409.0473 (2014).

[CR29] 29.Xiong, C., Merity, S. & Socher, R. Dynamic memory networks for visual and textual question answering. In International conference on machine learning, 2397–2406 (PMLR, 2016).

[CR30] 30.Aksan, E., Kaufmann, M., Cao, P. & Hilliges, O. A spatio-temporal transformer for 3d human motion prediction. In 2021 International Conference on 3D Vision (3DV), 565–574 (IEEE, 2021).

[CR31] 31.Li, M. et al. Skeleton graph scattering networks for 3d skeleton-based human motion prediction. In Proceedings of the IEEE/CVF international conference on computer vision, 854–864 (2021).

[CR32] 32.Zhou, T. et al. Sktformer: A skeleton transformer for long sequence data. In The Eleventh International Conference on Learning Representations (ICLR, 2023).

[CR33] 33.Ibh, M., Grasshof, S., Witzner, D. & Madeleine, P. Tempose: a new skeleton-based transformer model designed for fine-grained motion recognition in badminton. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5199–5208 (2023).

[CR34] 34.Mao, W., Liu, M., Salzmann, M. & Li, H. Learning trajectory dependencies for human motion prediction. In Proceedings of the IEEE/CVF international conference on computer vision, 9489–9497 (2019).

[CR35] 35.Fu, J., Yang, F., Dang, Y., Liu, X. & Yin, J. Learning constrained dynamic correlations in spatiotemporal graphs for motion prediction. IEEE Transactions on Neural Networks and Learning Systems (2023). [DOI] [PubMed]

[CR36] 36.Ionescu, C., Papava, D., Olaru, V. & Sminchisescu, C. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence 36, 1325–1339 (2013). [DOI] [PubMed]

[CR37] 37.Von Marcard, T., Henschel, R., Black, M. J., Rosenhahn, B. & Pons-Moll, G. Recovering accurate 3d human pose in the wild using imus and a moving camera. In Proceedings of the European conference on computer vision (ECCV), 601–617 (2018).

PERMALINK

Parallel multi-stage rectification networks for 3D skeleton-based motion prediction

Jianqi Zhong

Conghui Ye

Wenming Cao

Hao Wang

Abstract

Introduction

Related work

RNNs-based methods(recurrent neural networks)

GCN-based methods(graphs convolutional networks)

Attention-based methods

Methods

Fig. 1.

Parallel multi-stage rectification networks

Fig. 2.

Encoder-copy-decoder stage

ST-GCN

GCL

Encoder

Copy

Decoder

Fusion block based on GRU(Gated Recurrent Unit) and attention mechanism

Fig. 3.

Short-term information fusion based on GRU

Long-term information diffusion based on the attention mechanism

Experiments

Datasets

Comparison settings

Evaluation metrics

Lengths of input and output sequences

Implementation details

Comparisons with previous approaches

Table 1.

Table 2.

Fig. 4.

Table 3.

Table 4.

Model complexity analysis

Table 5.

Ablation studies

Architecture

Table 6.

Fig. 6.

Qualitative visualization

Fig. 5.

Size of N in the fusion block

Table 7.

Fig. 7.

Conclusion

Acknowledgements

Author contributions

Data availability

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases