. 2021 Sep 7;21(18):5996. doi: 10.3390/s21185996

Table 4.

Summary table of the technical implementation aspects of HPE of the filtered papers. * more details are provided in the following paragraphs.

Paper	Base Architecture/System	Methodology
[20]	Openpose * [21].	The RGB image and depth data is obtained using Kinect. Using Openpose the 2D pose is predicted and mapped with the acquired depth data to generate the 3D pose. Then, the 3D pose is used to estimate gait parameters, as explained in Table A4.
[22]	Hierarchical poselets, based on the concept of ‘poselet’ introduced in [23].	For each poselet, Histogram of Oriented Gradients (HOG) features are constructed and a linear SVM classifier is used for detecting the presence of each poselet. A poselet represents a specific configuration and appearance of a body part, working in this case with 20 body parts.
[12]	The framework could incorporate any part detector. In the example, spatio-temporally-linked Pictorial Structures are used to estimate the human pose.	Implementation of an algorithm for non-sequential propagation of keyframes to other similar frames using a Minimum Spaning Tree (MSP), reducing the amount of manual interaction or pose estimations.
[24]	10-layer hourglass network cascade model.	To solve the problem of self-occlusions of athletes in the air, the authors used the mutual relations between the key nodes in the heatmap generated by each level network, to continuously optimize the key nodes of shielding, and to improve the prediction accuracy of all key nodes.
[25]	3-part CNN architecture.	The first part is formed by the first twelve layers of VGG-19 [26]. The second part takes the set of features generated by the first part and estimated the hot spot map and loss, and the third part is divided at the same time into six parts, which use the hot spot map and loss of the previous part, and the set of inputs, to estimate hot spot maps and loss, till the result.
[27]	ResNet-50.	First of all, a binary human detection module is used to detect a human, similar to R-CNN serial models [28,29]. The CNN model ResNet-50 is used to extract features from each frame of a video. Sports videos usually suffer from blur due to the fast movement of athletes, so, to solve this, and, at the same time improve the performance of the system, the authors created a structural-aware Spatial-Temporal relation convolution module. This module analyzes the spatial relation of different keypoints in each time frame, as well as the temporal relation of each keypoint among different frames. These features are concatenated to obtain the keypoints of the analyzed person.
[30]	Processing of depth data.	The authors use a Kinect camera to obtain the depth image of a person. Then, apply an initial process for human extraction: floor-removal, a 3D-connected component-labeling technique [31] to segment the objects in the original depth image and identify human objects among the segmented ones by assuming that only humans move. Then, ridge data is generated making use of a distance transform map as in [32]. Finally, the estimation is done, starting with a calibration position of the body, and applying a hierarchical top-down HPE method, which makes the method invariant to rotation and occlusion, two things very frequent in dancing.
[13]	The architecture is based in [21].	Takes advantage of part affinity fields (PAFs) to preserve both location and orientation information across the region of support of the limb, which improves the estimation.
[17]	OpenPose.	The authors make use of an approach based on occupancy maps to associate person detections between viewpoints [33]. To reconstruct the person in 3D, each joint detection is back-projected using the calibration of the relevant camera to produce a ray in space, and with a least-squares solution, the “intersection” of the 3D rays is solved. In this way, the authors obtain an accuracy similar to the one obtained by marker-based systems.
[14]	VGG11	A feature fusion network is constructed using a pointwise feature, global feature, and RGB feature. C3D CNN model is used as feature extractor.
[15]	Convolutional Pose Machine (CPM) [34].	The HPE method is implemented as it is to be able to estimate other parameters related to the running form, such as speed, step frequency, and swing angles.
[35]	Stacked hourglass network proposed by [36].	The HyperStackNet architecture is divided into three parts: the original stacked hourglass network, which produces the initial heatmap of 16 joint positions, the latent pose vector, which concatenates each hourglass (there are 8 hourglass modules in the original stacked hourglass network) module’s output, and finally, the modified stacked hourglass network, which takes advantage from the information provided by the previous part to, on the one hand, improves the prediction, and on the other hand, add two more keypoints: the hockey stick.
[37]	CPM.	One fine-tuned CPM for each of the four main swimming styles (freestyle, backstroke, butterfly, and breaststroke). CPMs can perform very well in general-purpose context, but visually challenging footage of swimmers may still confuse the HPE systems, due to heavy splashes, water bubbles, or refractions, producing many false estimates and problems such as complete swaps of left and right body sides and single joint outliers. So, the authors implement three methods to improve the performance in this context: optimization for untangling joint swaps, a novel method for robust regression to approach the problem of filtering coordinate outliers and signal noise, and data-dependent filters for fine-tunning joint coordinates.
[18]	OpenPose.	The authors obtain the 3D position of each joint obtained by OpenPose, by applying the direct linear transform to each 2D keypoint to triangulate them.
[38]	Segmentation of the participant’s silhouettes.	Image thresholding was used for segmentation, it was applied to the blue color channel of the frame due to its significant contrast between the participant’s body and the environment. Obviously, this is a method that can only be applied in contexts like the one of this use case. The model was obtained from a swimming frame that contained a complete body segment, and the joint positions could be determined by looking for the centroid of intersection between two body parts. The proposed system was limited to the swimmers who have symmetrical butterfly stroke movement, as left and right body parts are not divided.
[39]	ResNet-101 (global network) and Region-based Fully Convolutional Network (R-FCN) (for local network).	The global network, a big deep network, estimates locations of parts using the global features, which are fed into the small network, the local one, in which position-sensitive ROI pooing based on R-FCN [40] is applied to refine the predictions using local information.
[41]	Segmentation of the participant’s silhouettes.	First, the salient region detection method is used to detect the visibly noticeable regions in the image, and then, a method for foreground segmentation by skin tone detection is implemented. By these two steps, the silhouette of a person is got. Then, five basic body keypoints are detected by using the body parts model, and seven more body keypoints are detected based on the previously detected keypoints.
[42]	Mask R-CNN [29].	Other HPE methods such as CPM were used previously, but even if the performance was acceptable, the error was higher due to outliers, and ski detection was a big problem. The authors developed a new model based on Mask R-CNN, which uses a branch to detect keypoints instead of generating segmentation masks, being able even to learn non-body keypoints, such as ski tips and ski tails, very interesting to be applied in the field of sports, in which, sometimes, the detection of sports tools is very interesting or even necessary depending on the objective of the application of the system.
[43]	Cascaded Pyramid Networks (CPN) [44].	First, a synthetic dataset is rendered, which is converted to a synthetic realistic dataset by the use of CycleGAN [45]. Then, the initial synthetic data, in combination with the cycled-synthetic one, and mixed with COCO, is used to train CPN.
[46]	VNect [47].	VNect is used for 2D pose estimation, which is based on ResNet50 [48]. Then, a residual linear network, based in [49], is used to recover the 2D joint positions to 3D.