Enhanced keypoint recognition framework via multi-scale feature characteristics

Miao Huang; Jingli Gao; Li Ma

doi:10.1038/s41598-025-23831-0

. 2025 Nov 17;15:40136. doi: 10.1038/s41598-025-23831-0

Enhanced keypoint recognition framework via multi-scale feature characteristics

Miao Huang ¹, Jingli Gao ^1,^✉, Li Ma ¹

PMCID: PMC12623967 PMID: 41249235

Abstract

Keypoint recognition plays a critical role in various computer vision tasks, such as human pose estimation, action recognition, and behavior analysis. Despite significant advancement, existing methods often struggle with accurately detecting small-scale keypoints within complex environments and maintaining the structural integrity of human poses. Therefore, we introduce an enhanced keypoint recognition framework that leverages multi-scale feature characteristics. Specifically, it utilizes Multi-Scale Feature Attention (MSFA) module to fuse features from multiple scales, enabling more effective recognition of small and challenging keypoints. In addition, a structural consistency loss is proposed to ensure the proper alignment of keypoints. Extensive experiments conducted on the MPII Human Pose datasets demonstrate that our approach outperforms existing methods in terms of both accuracy and robustness. This proposed framework advances the state-of-the-art in keypoint recognition, offering a computationally efficient solution for real-world applications that require precise human pose estimation.

Keywords: Keypoint recognition, Multi-scale feature, Deep learning, Structural consistency loss

Subject terms: Computational science, Computer science

Introduction

Keypoint recognition is crucial in computer vision applications, including human pose estimation, action recognition, and behavior analysis^1,2. Accurate detection of keypoints enables detailed motion analysis, which is essential for tasks such as video surveillance and sports analytics³. In medical imaging, keypoint recognition plays a key role in diagnosing conditions based on body alignment⁴. It is also vital for autonomous driving, improving pedestrian detection and overall safety^3,4. As a fundamental task in computer vision, keypoint recognition facilitates the interpretation of complex visual data.

Keypoint recognition faces several challenges, particularly in detecting small-scale keypoints in complex environments^5,6. Small keypoints, such as facial landmarks and fingertips, are exposed to errors due to their size and sensitivity to noise, occlusion, and low resolution⁷. Maintaining structural consistency across keypoints is another challenge, as misaligned predictions can result in unrealistic body shapes and movements^8,9. In real-world applications, variations in pose, lighting, and background clutter further complicates keypoint detection^10–12. These challenges highlight the need for advanced models capable of effectively handling both local and global contexts.

Over the years, numerous studies have aimed to improve keypoint recognition, particularly in human pose estimation. Early techniques, such as Active Shape Models (ASMs) and Deformable Part Models (DPMs), were among the first to incorporate geometric constraints for keypoint detection^13,14. While these methods showed some success, they struggled in real-world scenarios involving complex poses and occlusions¹⁵.

With the rise of deep learning (DL), Convolutional Neural Networks (CNNs) have become the dominant approach for keypoint recognition, owing to their ability to learn hierarchical features from large datasets such as MPII Human Pose¹⁶. Tompson et al. introduced a model that combined CNNs with graphical models, enhancing accuracy by refining keypoint predictions through multiple stages¹⁷. Similarly, Newell et al. proposed the stacked hourglass network, which employs a symmetric architecture to repeatedly down-sample and up-sample feature maps, refining keypoint detection across different resolutions¹⁸. This approach became fundamental for improving accuracy in detecting small keypoints and handling complex poses¹⁹. Moreover, Cao et al. later introduced OpenPose, a bottom-up method that uses Part Affinity Fields (PAFs) to jointly detect keypoints and associate them with specific individuals²⁰. This framework efficiently handles multi-person pose estimation, making it one of the most widely adopted methods in the field. Building on this, Sun et al. developed a High-Resolution backbone Network (HRNet), maintaining high-resolution feature maps throughout the network and allowing for finer keypoint localization²¹. HRNet demonstrated exceptional performance across a wide range of tasks, including the detection of small-scale keypoints. Hong et al. proposed a geometry-aware approach based on Stacked Capsule Graph Autoencoders (SCGAE) that explicitly encodes the relationships of facial parts and poses via capsule networks and graph regularization²². This method produces a robust 3D head pose representation and has demonstrated improved accuracy on standard head pose datasets by leveraging inherent geometric constraints.

In recent years, Transformer-based models have gained attention for their ability to capture long-range dependencies within images. For instance, Carion et al. introduced DETR (DEtection TRansformers), leveraged attention mechanisms to enhance keypoint detection and localization²³. Similarly, Girdhar et al. extended the use of Transformers to human pose estimation, demonstrating promising results in addressing challenges related to occlusion and pose variability²⁴. Xie et al. introduced a High-Order Graph Convolution Transformer (HOGFormer) that combines graph convolutional networks with Transformer architectures to capture both global and local skeletal features²⁵. This high-order model addresses challenges like self-occlusion and depth ambiguity by using a dynamic adjacency matrix representation, and it achieves state-of-the-art accuracy on the Human3.6M 3D pose estimation benchmark. Huang et al. presented a Channel MLP (CM) module to enhance Transformer-based pose estimators by modeling channel-wise information within the network²⁶. Integrating this lightweight channel-attention MLP into a TokenPose architecture improved keypoint detection performance (e.g., achieving 75.2 AP on COCO test-dev) without increasing model complexity.

Despite significant progress, existing methods still face challenges in accurately detecting small keypoints in cluttered environments and maintaining structural consistency^27–29. Furthermore, balancing high accuracy with real-time processing remains a significant challenge³⁰. In this paper, we propose an enhanced keypoint recognition framework designed to address these challenges. The proposed approach introduces a Multi-Scale Feature Attention (MSFA) module that captures both local and global features at multiple scales, leading to more accurate keypoint detection in complex scenes. Moreover, a structural consistency loss is incorporated to ensure that the detected keypoints are naturally aligned, thereby reducing errors in human pose estimation. The main contributions of this paper are summarized as follows:

We report an enhanced keypoint recognition framework that combines HRNet and MSFA module to offer a computationally efficient solution for real-world applications that requiring precise human pose estimation.
We propose a multi-scale feature attention (MSFA) module to fuse local and global features together, enabling more effective recognition of small and challenging keypoints.
We propose a structural consistency loss to ensure the proper alignment of keypoints, thereby reducing errors in human pose estimation.

Methods

Overview

Figure 1 shows the proposed DL enhanced keypoint recognition framework, which consists of two main components: HRNet, for feature extraction, and the MSFA module, which integrates multi-scale features to enhance keypoint detection. This framework aims to overcome the limitations of traditional networks in detecting small keypoints and preserving the structural integrity of human poses.

Fig. 1 — Proposed DL enhanced keypoint recognition framework, based on HRNet and MSFA.

During training, each input is a two-dimensional (2D) image from the MPII Human Pose datasets, and the output consists of Inline graphic keypoints representing human joints. Each keypoint is predicted with its class, coordinates, scales, and other relevant parameters. The combination of HRNet’s high-resolution feature preservation and the MSFA module ensures that both large and small keypoints are detected with high accuracy.

Network architecture

HRNet backbone for feature extraction

HRNet, serves as the backbone of the proposed framework, providing the essential feature extraction needed for accurate keypoint recognition. It has demonstrated strong performance in various vision tasks, owing to its unique ability to maintain high-resolution representations throughout its architecture. This high-resolution preservation is especially crucial in keypoint recognition, where small features such as fingertips or facial landmarks can easily be lost in down-sampled representations.

HRNet operates by maintaining parallel branches of high- and low-resolution feature maps, exchanging information at each stage. This parallel structure enables the network to capture both detailed local features and broader contextual information simultaneously. Let Inline graphic represent the input image, where H and W denote its height and width, respectively. HRNet processes the image through multiple stages, preserving and refining high-resolution features throughout the network.

Let Inline graphic denote the feature map at the i-th stage of HRNet. Each stage processes the feature map while retaining high-resolution spatial information, allowing for precise keypoint prediction. Keypoint parameters are directly produced using the HRNet output stage, as expressed in (1).

where Inline graphic is the final feature map from HRNet, and the convolutional layer employs a softmax operation to generate the corresponding parameters. HRNet preserves the high-resolution feature maps through a multi-scale fusion process, where feature maps from different resolutions are combined. This fusion enables the network retain fine spatial details, which improves small keypoint localization. The multi-scale fusion process can be represented as follows:

where Inline graphic represent the feature maps from different resolution branches, and are weights assigned to each resolution, used to balance their contributions to the final fused feature map.

Equation (3) shows the HRNet output, where Inline graphic and represent the horizontal and vertical coordinates of the object, respectively, whereas s is the scale, and denote the class, horizontal coordinate and vertical coordinate of the keypoint. Moreover, is generated by directly predicting the keypoint parameters from the fused feature map. This process guarantees that HRNet captures both local and global context, leading to highly effeciency in predicting both small and large keypoints.

Multi-scale feature attention (MSFA) module

Once HRNet generates the initial keypoint parameters, the MSFA module refines these parameters by enhancing the feature maps across multiple scales. This module operates by applying convolutional layers with varying receptive fields, enabling the model to better capture keypoint details that exist at different scales.

Figure 2 shows the MSFA module structure. Let Inline graphic represent the refined feature map after applying the MSFA module. It is produced using the following equation:

where Inline graphic represents the kernel size of the different receptive fields and denotes the convolution operation. This multi-scale processing ensures that the keypoint predictions are sensitive to both small and large details. Moreover, the convolution kernel of each branch is connected to a self-attention module, thus improving accuracy in difficult cases such as occlusions or noisy environments.

The MSFA-refined keypoint predictions Inline graphic are then obtained by applying the refined feature maps to the same keypoint prediction layers as was done for HRNet. This is expressed as shown in (5).

Structural consistency loss

One of the key challenges in keypoint recognition is maintaining the structural consistency of human poses, particularly when keypoints are occluded or inaccurately predicted. To address this issue, a Structural Consistency Loss (SC-Loss) is introduced, which ensures that the predicted keypoints form a realistic human pose by implementing spatial and angular constraints between them.

Moreover, the SC-Loss consists of two components: a distance-based loss, and an angle-based loss. These components work together to preserve the relative distances and angles between keypoints, which are crucial for ensuring that the predicted pose remains anatomically plausible.

Distance loss

The distance loss ensures that the predicted distances between keypoints are consistent with the ground-truth distances, as provided in the annotations. Let Inline graphic represent the set of predicted keypoints, and let be the Euclidean distance between two adjacent keypoints and . The distance loss is defined in (6), where is the ground truth distance between keypoints and . This loss function penalizes deviations between the predicted and true distances, ensuring that the relative positioning of keypoints remains consistent with human anatomy.

Masked angle loss

The masked angle loss enforces angular consistency between keypoint triplets, ensuring that the angles between connected keypoints result in realistic human poses. Let Inline graphic denote the joint angle formed by three adjacent keypoints. Instead of penalizing all angular deviations uniformly, a differentiable soft angular mask , which down-weights penalties for anatomically plausible angles and up-weights penalties for implausible ones. The masked angular loss is defined as:

where Inline graphic denotes the ground truth angle between these keypoints, and is a learnable or dataset-derived angular confidence mask. It assigns higher weights to angle deviations falling outside typical human biomechanical limits, and reduces penalties within natural motion ranges. This formulation avoids over-penalizing flexible but valid poses while maintaining sensitivity to structurally implausible configurations. MSFA aims to improve the recognition of very small keypoints and enhance the fusion of local features across scales, thereby overcoming HRNet’s limitations in these aspects.

Therefore, the total structural consistency loss is described in (8), where Inline graphic denotes a weighting factor that balances the importance of distance and angular consistency.

Results and discussion

Data preparation

In this study, the MPII Human Pose dataset is used to train and evaluate the proposed framework³¹. The dataset consists of a diverse range of images depicting human poses in various real-world scenarios, with over 40,000 annotated poses across approximately 25,000 images. Each image is annotated with k keypoints representing major human joints, such as the shoulders, elbows, knees, and hips. For this study, the dataset is divided into a training set, validation set, and test set. Typically, Inline graphic of the data is used for training, for validation, and the remaining for testing, ensuring a balanced distribution of poses and activities across all sets. This split enables effective model training while reserving a portion of the data to fine-tune hyper-parameters during validation and assess final performance during testing.

To further ensure robustness, the dataset split is stratified to maintain an even distribution of challenging examples, such as images with occlusions or complex poses, across the training and test sets. This approach helps prevent bias in model evaluation and ensures the model is exposed to a wide variety of pose configurations during training. Moreover, all images are resized to a fixed resolution for consistency during training and to meet the input requirements of the architecture. Normalization is also applied to ensure that the pixel values fall within the same range, improving the stability and efficiency of model training.

Implementation

Experiments were implemented in Python 3.8 and Tensorflow 1.8 on a workstation featuring an Intel(R) Core(TM) i9-10900K CPU and an NVIDIA RTX 3090 GPU (24GBytes).

The Adam algorithm³² is used to optimize the loss function. The learning rate is initially set to Inline graphic , with a cosine annealing learning rate schedule to gradually reduce the learning rate over 100 epochs. The Mean Squared Error (MSE) is employed to evaluate the performance of the well-trained network on the test dataset.

Comparison

To evaluate the performance of the proposed framework, it is compared to several state-of-the-art keypoint recognition methods, including HRNet⁸, OpenPose²⁰, StackedHourglass¹⁸, and DeepPose¹⁶. Each of these models is trained and tested on the same datasets to ensure a fair comparison. To clearly delineate the individual contributions and interaction effects of each proposed component, comprehensive ablation studies were conducted. Specifically, we evaluated the baseline model HRNet, HRNet with MSFA alone, HRNet with Structural Consistency Loss alone, and the final integrated model combining both MSFA and Structural Consistency Loss. As shown in (9), the accuracy of keypoint prediction is measured using the Percentage of Correct Keypoints (PCKh) metric, which considering a keypoint prediction correct if it lies within a specified threshold distance from the ground truth, normalized by the head segment length as shown below:

A keypoint is considered correct if the Euclidean distance between the predicted and ground truth keypoints is less than Inline graphic of the head segment length. This metric is commonly used in human pose estimation tasks.To comprehensively evaluate the performance of the framework proposed in this paper, additional joint mean localization error (MPJPE)³³ and F1-score³⁴ were added.

Experiments

Accuracy evaluation

Figure3 shows the heatmaps of the prediction results on the test sets. The proposed method demonstrates strong performance in detecting both large and small keypoints, even in challenging poses and occluded regions. The heatmaps are concentrated around the correct keypoint locations, showcasing precise predictions.

Fig. 3 — Predicted heatmap of keypoints of the test sets.

To provide a more detailed evaluation, the PCKh metric is analyzed for different categories of keypoints, including large-scale keypoints (e.g., shoulders, hips) and small-scale keypoints (e.g., wrists, ankles). Table 1 presents the quantitative results of the proposed method compared to various state-of-the-art keypoint recognition algorithms, with all models evaluated on the same dataset. As shown in Table 1, the proposed framework achieves the highest accuracy among all compared methods. The improvement over the baseline HRNet is mainly attributed to the addition of the MSFA module, which significantly improves the detection of small keypoints by effectively integrating multi-scale features. In contrast, traditional methods like OpenPose and StackedHourglass, while efficient in some scenarios, struggle to maintain similar accuracy in more complex environments. Even compared to more recent transformer-based approaches such as TokenPose and PoseFormer, our method continues to outperform them, achieving superior PCKh especially for small-scale keypoints. Notably, our approach also attains the best overall PCKh and the highest F1-score, indicating its superior performance in both keypoint localization accuracy and maintaining structural consistency in the predicted poses.

Table 1.

Quantitative results of keypoint recognition.

Method	HRNet	OpenPose	StackedHourglass	DeepPose	TokenPose	PoseFormer	Our Method
PCKh	95.4%	93.1%	91.8%	91.2%	96.0%	96.3%	97.0%
Large-scale	95.5%	93.2%	91.9%	91.4%	96.2%	96.4%	97.3%
Small-scale	95.1%	93.0%	91.7%	91.0%	95.8%	96.0%	96.8%
MPJPE	5.7	6.2	7.1	7.1	5.3	5.1	4.9
F1-Score	94.2%	93.0%	91.0%	91.0%	95.4%	95.7%	96.3%

Open in a new tab

In addition to the overall comparisons, an ablation study is conducted to examine the contributions of the MSFA module’s design and the SC-Loss sub-components, with results presented in Table 2. Specifically, a variant of MSFA without the attention mechanism and another using only single-scale feature fusion are evaluated, along with SC-Loss variations that apply only the distance constraint or only the angle constraint. As shown in Table 2, each of these simplified variants leads to a drop in performance compared to the full model, highlighting the importance of both components for optimal accuracy. For instance, removing the attention mechanism or restricting MSFA to a single scale noticeably reduces the accuracy, particularly for small keypoints, which demonstrates that attention-driven multi-scale feature integration is crucial for capturing fine details. Similarly, using only the distance term or only the angle term in the SC-Loss yields inferior accuracy and F1-scores - the distance-only loss cannot fully enforce correct limb proportions, while the angle-only loss is insufficient to maintain realistic joint configurations. Ultimately, the full model with MSFA (multi-scale with attention) and the complete SC-Loss achieves the highest accuracy, confirming that each component complements the other and that their combination provides the most significant improvement in keypoint detection performance.

Table 2.

Detailed ablation results on MSFA and SC-Loss.

Method	PCKh	Large-scale	Small-scale	MPJPE	F1-Score
HRNet (baseline)	95.4%	95.5%	95.1%	5.7	94.2%
HRNet + MSFA (no attention)	96.0%	96.1%	95.8%	5.4	95.0%
HRNet + MSFA (single scale only)	95.6%	95.7%	95.4%	5.6	94.6%
HRNet + SC-Loss (distance only)	95.7%	95.9%	95.5%	5.5	94.4%
HRNet + SC-Loss (angle only)	95.5%	95.6%	95.3%	5.6	94.6%
Our method	97.0%	97.3%	96.8%	4.9	96.3%

Open in a new tab

Qualitative analysis of keypoint detection

The keypoint detection qualitative analysis highlights the robustness of the proposed method, especially when considering challenging situations where other models tend to fail. Fig. 4 illustrates parts of examples of keypoint recognition of the proposed method.

Fig. 4 — Examples of keypoint recognition of the proposed method.

Occluded Keypoints: A key challenge in human pose estimation is predicting keypoints that are partially or fully occluded by other objects or body parts. The MSFA module enables our framework to better capture global context and small-scale details, thereby improving the detection of occluded keypoints. Images highlighted by the green rectangle shows examples of occluded keypoints. Even when a person’s body is partially occluded by another object, the proposed method can still accurately predict the locations of the keypoints.

Distances between keypoints are too close: some body parts are difficult to detect because they are too close to each other in the image. Merging multi-scale features through the MSFA module enhances the network’s ability to identify these small key points by processing features at different resolutions. Images highlighted by the blue rectangle show examples of such cases. Even when body parts are very close to each other, the proposed method can still accurately predict the locations of the keypoints.

Extreme poses: in cases where the human body assumes extreme or unusual poses (e.g., during sports activities), maintaining structural consistency between keypoints is crucial. The Structural Consistency Loss introduced in the proposed framework helps preserve the anatomical relationships between keypoints, leading to more realistic pose predictions, even in complex poses. Images highlighted by the yellow rectangle provide examples of extreme poses where the proposed method yields accurate predictions.

The robustness of the proposed method is further evaluated in complex scenarios, including occluded keypoints, close proximity of body parts, and extreme poses. Table 3 summarizes the accuracy of keypoint predictions under these specific conditions. Note that PCKh is exclusively used here, as it directly reflects spatial accuracy, making it most suitable for intuitive qualitative visualization. The results demonstrate that the proposed method achieves superior performance compared to baseline models, particularly in occlusion scenarios, where the MSFA module captures the global context effectively.

Table 3.

PCKh in complex scenarios.

Method	HRNet	OpenPose	StackedHourglass	DeepPose	TokenPose	PoseFormer	Our Method
PCKh	95.4%	93.1%	91.8%	91.2%	96.0%	96.3%	97.0%
Large-scale	95.5%	93.2%	91.9%	91.4%	96.2%	96.4%	97.3%
Small-scale	95.1%	93.0%	91.7%	91.0%	95.8%	96.0%	96.8%
MPJPE	5.7	6.2	7.1	7.1	5.3	5.1	4.9
F1-Score	94.2%	93.0%	91.0%	91.0%	95.4%	95.7%	96.3%

Open in a new tab

Computational efficiency and discussion in particularity

In addition to accuracy, computational efficiency is a key factor when evaluating keypoint detection models, especially for real-time applications such as video analysis. Table 4 reports both FLOPs and FPS of all compared methods under the same hardware configuration.

Table 4.

Runtime performance comparison.

Method	HRNet	OpenPose	StackedHourglass	DeepPose	TokenPose	PoseFormer	Our Method
FLOPs (G)	20.3	11.0	14.7	9.0	13.2	12.5	16.0
FPS	14	8	10	6	10	9	12

Open in a new tab

Although our method introduces slightly more FLOPs than HRNet due to the MSFA module, it still runs at a competitive speed (12 FPS) while achieving the highest accuracy. More importantly, it clearly outperforms OpenPose, StackedHourglass, and DeepPose in both accuracy and efficiency, and also maintains a better balance between FLOPs and FPS than recent transformer-based approaches such as TokenPose and PoseFormer. These results confirm that our design provides a favorable trade-off between precision and efficiency, making it suitable for practical scenarios where both accuracy and speed are required.

While high-performance models often incorporate advanced architectures to achieve state-of-the-art results, these improvements frequently come at the expense of increased computational complexity, making them less suitable for scenarios requiring real-time processing or deployment on resource-constrained devices. In contrast, the proposed framework adopts a streamlined approach by combining the high-resolution feature extraction capabilities of HRNet with the multi-scale feature integration of the MSFA module. This design ensures robust keypoint detection, particularly for small-scale and occluded keypoints, while maintaining computational efficiency. The introduction of the Structural Consistency Loss further enhances the framework’s ability to maintain anatomically plausible predictions, a critical aspect for applications such as pose estimation in cluttered environments or complex scenes.

One of the primary advantages of this framework lies in its adaptability to varying computational constraints. By focusing on lightweight yet effective modules, the framework avoids the trade-off between accuracy and efficiency that characterizes many complex architectures. This balance positions the proposed method as a practical solution for diverse use cases, from video surveillance to human-computer interaction, where both precision and speed are critical.

Moreover, the framework emphasizes interpretability and modularity, enabling seamless integration into existing pipelines. The reliance on multi-scale features and structured loss functions underscores its ability to generalize across diverse datasets and scenarios, without the need for excessive computational resources. These attributes highlight the practicality and scalability of the proposed method, making it a versatile tool in the field of keypoint recognition. The framework excels in challenging scenarios such as detecting small or partially occluded keypoints while preserving pose structure. However, it still faces difficulties under severe occlusions (for instance, completely overlapping limbs) or highly ambiguous poses that fall outside typical movement patterns. We emphasize that these extreme cases remain challenging and are beyond the scope of our current approach, although they will be addressed in future work.

Conclusion

This paper presents an enhanced keypoint recognition framework that combines MSFA with HRNet to address key challenges in human pose estimation. By preserving high-resolution feature maps and refining multi-scale representations, the proposed method demonstrates superior performance in detecting small and occluded keypoints, areas where conventional models often struggle. Furthermore, the integration of a structural consistency loss ensures that the predicted keypoints maintain anatomically plausible relationships, improving the overall structural integrity of the pose.

The effectiveness of the proposed framework was validated through extensive experiments on the MPII Human Pose dataset, where it achieved an accuracy of Inline graphic , surpassing the performance of well-established methods. The framework excels particularly in challenging scenarios, such as detecting small and occluded keypoints, while consistently preserving the structural coherence of complex poses. However, certain limitations remain, especially in cases involving overlapping limbs or highly ambiguous poses. Despite these challenges, the overall performance highlights the robustness, accuracy, and computational efficiency of the framework, making it well-suited for real-world applications that require precise human pose estimation.

As for the future work, further refinement of the framework could involve incorporating a dynamic attention mechanism to adaptively emphasize regions with high pose ambiguity. Additionally, exploring lightweight model adaptations may enhance real-time performance, broadening the framework’s applicability in time-sensitive scenarios such as video surveillance and interactive applications.

Acknowledgements

This work was supported by the Henan Provincial Science and Technology Research Project under Grant 232102210019 and 252102320298, and the Doctoral Startup Foundation of Pingdingshan University under Grant PXY-BSQD-2018028.

Author contributions

Miao Huang: Writing – original draft, Methodology, Validation, Conceptualization. Jingli Gao: Writing – review & editing, Conceptualization, Supervision. Li Ma: Writing – review & editing, Visualization.

Data availability

The code used for data analysis in this work can be obtained from the corresponding author, Jingli Gao (gjl@pdsu.edu.cn), upon reasonable request.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Moeslund, T. B., Hilton, A. & Krüger, V. A survey of advances in vision-based human motion capture and analysis. Comput. Vis. Image Underst.104, 90–126 (2006). [Google Scholar]
2.Poppe, R. Vision-based human motion analysis: an overview. Comput. Vis. Image Underst.108, 4–18 (2007). [Google Scholar]
3.Felzenszwalb, P. F., Girshick, R. B., McAllester, D. & Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell.322, 1627–1645 (2009). [DOI] [PubMed] [Google Scholar]
4.Ioffe, S. et al. Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprintarXiv:1502.03167 (2015).
5.Bulat, A. & Tzimiropoulos, G. Human pose estimation via convolutional part heatmap regression. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14 717–732 (Springer, 2016).
6.Chen, X. et al. Mobrecon: Mobile-friendly hand mesh reconstruction from monocular image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2054–2054 (2022).
7.Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition 7132–7141 (2018).
8.Zhang, H. et al. Hf-hrnet: a simple hardware friendly high-resolution network. IEEE Transactions on Circuits and Systems for Video Technology (2024).
9.Li, W. et al. Rethinking on multi-stage networks for human pose estimation. arXiv preprintarXiv:1901.00148 (2019).
10.Reddy, N. D., Vo, M. & Narasimhan, S. G. Occlusion-net: 2d/3d occluded keypoint localization using graph networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 7326–7335 (2019).
11.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770–778 (2016).
12.Zhang, Z., Luo, P., Loy, C. C. & Tang, X. Facial landmark detection by deep multi-task learning. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13 94–108 (Springer, 2014).
13.Cootes, T. F., Edwards, G. J. & Taylor, C. J. Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell.23, 681–685 (2001). [Google Scholar]
14.Felzenszwalb, P. F., Girshick, R. B., McAllester, D. & Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell.32, 1627–1645 (2009). [DOI] [PubMed] [Google Scholar]
15.Delingette, H. General object reconstruction based on simplex meshes. Int. J. Comput. Vision32, 111–146 (1999). [Google Scholar]
16.Toshev, A. & Szegedy, C. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1653–1660 (2014).
17.Tompson, J. J., Jain, A., LeCun, Y. & Bregler, C. Joint training of a convolutional network and a graphical model for human pose estimation. Adv. Neural Inf. Process. Sysst.27, 256 (2014). [Google Scholar]
18.Newell, A., Yang, K. & Deng, J. Stacked hourglass networks for human pose estimation. In Computer Vision - 14th European Conference, ECCV 2016, Proceedings (Springer, 2016).
19.Zou, X., Bi, X. & Yu, C. Improving human pose estimation based on stacked hourglass network. Neural Process. Lett.55, 9521–9544 (2023). [Google Scholar]
20.Cao, Z., Simon, T., Wei, S.-E. & Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 7291–7299 (2017).
21.Sun, K., Xiao, B., Liu, D. & Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 5693–5703 (2019).
22.Hong, C., Chen, L., Liang, Y. & Zeng, Z. Stacked capsule graph autoencoders for geometry-aware 3d head pose estimation. Comput. Vis. Image Understand.2021, 103224 (2021). [Google Scholar]
23.Carion, N. et al. End-to-end object detection with transformers. In European Conference on Computer Vision 213–229 (Springer, 2020).
24.Girdhar, R., Gkioxari, G., Torresani, L., Paluri, M. & Tran, D. Detect-and-track: efficient pose estimation in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 350–359 (2018).
25.Xie, Y., Hong, C., Zhuang, W., Liu, L. & Li, J. Hogformer: high-order graph convolution transformer for 3d human pose estimation. Int. J. Mach. Learn. Cybern.16, 599–610 (2025). [Google Scholar]
26.Huang, J., Hong, C., Xie, R., Ran, L. & Qian, J. A simple and efficient channel mlp on token for human pose estimation. Int. J. Mach. Learn. Cybern.2025, 256 (2025). [Google Scholar]
27.Ross, T.-Y. & Dollár, G. Focal loss for dense object detection. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2980–2988 (2017).
28.Wang, C.-Y. et al. Cspnet: a new backbone that can enhance learning capability of cnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 390–391 (2020).
29.Chen, X. et al. Mobrecon: mobile-friendly hand mesh reconstruction from monocular image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 20544–20554 (2022).
30.Guan, Q., Sheng, Z. & Xue, S. Hrpose: real-time high-resolution 6d pose estimation network using knowledge distillation. Chin. J. Electron.32, 189–198 (2023). [Google Scholar]
31.Andriluka, M., Pishchulin, L., Gehler, P. & Schiele, B. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition 3686–3693 (2014).
32.Kingma, D. P. et al. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014).
33.Wang, J. et al. Deep 3d human pose estimation: a review. Comput. Vis. Image Underst.210, 103225 (2021). [Google Scholar]
34.Li, C., Zhang, B., Shi, J. & Cheng, G. Multi-level domain adaptation for lane detection. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 4379–4388 (2022).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The code used for data analysis in this work can be obtained from the corresponding author, Jingli Gao (gjl@pdsu.edu.cn), upon reasonable request.

[CR1] 1.Moeslund, T. B., Hilton, A. & Krüger, V. A survey of advances in vision-based human motion capture and analysis. Comput. Vis. Image Underst.104, 90–126 (2006). [Google Scholar]

[CR2] 2.Poppe, R. Vision-based human motion analysis: an overview. Comput. Vis. Image Underst.108, 4–18 (2007). [Google Scholar]

[CR3] 3.Felzenszwalb, P. F., Girshick, R. B., McAllester, D. & Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell.322, 1627–1645 (2009). [DOI] [PubMed] [Google Scholar]

[CR4] 4.Ioffe, S. et al. Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprintarXiv:1502.03167 (2015).

[CR5] 5.Bulat, A. & Tzimiropoulos, G. Human pose estimation via convolutional part heatmap regression. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14 717–732 (Springer, 2016).

[CR6] 6.Chen, X. et al. Mobrecon: Mobile-friendly hand mesh reconstruction from monocular image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2054–2054 (2022).

[CR7] 7.Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition 7132–7141 (2018).

[CR8] 8.Zhang, H. et al. Hf-hrnet: a simple hardware friendly high-resolution network. IEEE Transactions on Circuits and Systems for Video Technology (2024).

[CR9] 9.Li, W. et al. Rethinking on multi-stage networks for human pose estimation. arXiv preprintarXiv:1901.00148 (2019).

[CR10] 10.Reddy, N. D., Vo, M. & Narasimhan, S. G. Occlusion-net: 2d/3d occluded keypoint localization using graph networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 7326–7335 (2019).

[CR11] 11.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770–778 (2016).

[CR12] 12.Zhang, Z., Luo, P., Loy, C. C. & Tang, X. Facial landmark detection by deep multi-task learning. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13 94–108 (Springer, 2014).

[CR13] 13.Cootes, T. F., Edwards, G. J. & Taylor, C. J. Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell.23, 681–685 (2001). [Google Scholar]

[CR14] 14.Felzenszwalb, P. F., Girshick, R. B., McAllester, D. & Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell.32, 1627–1645 (2009). [DOI] [PubMed] [Google Scholar]

[CR15] 15.Delingette, H. General object reconstruction based on simplex meshes. Int. J. Comput. Vision32, 111–146 (1999). [Google Scholar]

[CR16] 16.Toshev, A. & Szegedy, C. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1653–1660 (2014).

[CR17] 17.Tompson, J. J., Jain, A., LeCun, Y. & Bregler, C. Joint training of a convolutional network and a graphical model for human pose estimation. Adv. Neural Inf. Process. Sysst.27, 256 (2014). [Google Scholar]

[CR18] 18.Newell, A., Yang, K. & Deng, J. Stacked hourglass networks for human pose estimation. In Computer Vision - 14th European Conference, ECCV 2016, Proceedings (Springer, 2016).

[CR19] 19.Zou, X., Bi, X. & Yu, C. Improving human pose estimation based on stacked hourglass network. Neural Process. Lett.55, 9521–9544 (2023). [Google Scholar]

[CR20] 20.Cao, Z., Simon, T., Wei, S.-E. & Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 7291–7299 (2017).

[CR21] 21.Sun, K., Xiao, B., Liu, D. & Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 5693–5703 (2019).

[CR22] 22.Hong, C., Chen, L., Liang, Y. & Zeng, Z. Stacked capsule graph autoencoders for geometry-aware 3d head pose estimation. Comput. Vis. Image Understand.2021, 103224 (2021). [Google Scholar]

[CR23] 23.Carion, N. et al. End-to-end object detection with transformers. In European Conference on Computer Vision 213–229 (Springer, 2020).

[CR24] 24.Girdhar, R., Gkioxari, G., Torresani, L., Paluri, M. & Tran, D. Detect-and-track: efficient pose estimation in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 350–359 (2018).

[CR25] 25.Xie, Y., Hong, C., Zhuang, W., Liu, L. & Li, J. Hogformer: high-order graph convolution transformer for 3d human pose estimation. Int. J. Mach. Learn. Cybern.16, 599–610 (2025). [Google Scholar]

[CR26] 26.Huang, J., Hong, C., Xie, R., Ran, L. & Qian, J. A simple and efficient channel mlp on token for human pose estimation. Int. J. Mach. Learn. Cybern.2025, 256 (2025). [Google Scholar]

[CR27] 27.Ross, T.-Y. & Dollár, G. Focal loss for dense object detection. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2980–2988 (2017).

[CR28] 28.Wang, C.-Y. et al. Cspnet: a new backbone that can enhance learning capability of cnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 390–391 (2020).

[CR29] 29.Chen, X. et al. Mobrecon: mobile-friendly hand mesh reconstruction from monocular image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 20544–20554 (2022).

[CR30] 30.Guan, Q., Sheng, Z. & Xue, S. Hrpose: real-time high-resolution 6d pose estimation network using knowledge distillation. Chin. J. Electron.32, 189–198 (2023). [Google Scholar]

[CR31] 31.Andriluka, M., Pishchulin, L., Gehler, P. & Schiele, B. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition 3686–3693 (2014).

[CR32] 32.Kingma, D. P. et al. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014).

[CR33] 33.Wang, J. et al. Deep 3d human pose estimation: a review. Comput. Vis. Image Underst.210, 103225 (2021). [Google Scholar]

[CR34] 34.Li, C., Zhang, B., Shi, J. & Cheng, G. Multi-level domain adaptation for lane detection. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 4379–4388 (2022).

PERMALINK

Enhanced keypoint recognition framework via multi-scale feature characteristics

Miao Huang

Jingli Gao

Li Ma

Abstract

Introduction

Methods

Overview

Fig. 1.

Network architecture

HRNet backbone for feature extraction

Multi-scale feature attention (MSFA) module

Fig. 2.

Structural consistency loss

Distance loss

Masked angle loss

Results and discussion

Data preparation

Implementation

Comparison

Experiments

Accuracy evaluation

Fig. 3.

Table 1.

Table 2.

Qualitative analysis of keypoint detection

Fig. 4.

Table 3.

Computational efficiency and discussion in particularity

Table 4.

Conclusion

Acknowledgements

Author contributions

Data availability

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases