Abstract
We developed a method to quantify hand postural tremor amplitude and frequency from standard videos without external calibration. Using iris-based scaling and deep-learning models for 3D pose estimation and hand tracking, the automatic approach demonstrated excellent agreement with manual measures. Clinical validation in persons with Parkinson’s disease demonstrated sensitivity to treatment effects and strong correlation with clinician-provided scores. The proposed method, available as part of VisionMD, enables objective, scalable assessment of postural tremor from standard videos.
Subject terms: Computational biology and bioinformatics, Diseases, Health care, Medical research
Introduction
Tremor is a prevalent motor symptom across multiple neurological movement disorders, including Parkinson’s disease, essential tremor, and dystonia, which collectively affect over 7.5 million adults in the United States alone1–3. Tremor presence and severity are key indicators that support diagnosis and evaluation of disease progression and treatment efficacy. Traditionally, clinical evaluation of tremor relies on visual inspection and standardized rating scales4, which are inherently subjective and result in infrequent evaluations due to the limited number of clinicians capacitated to perform the evaluations5. Given the importance of accurately quantifying tremor, numerous studies have proposed digital health solutions for objective and continuous evaluation in the form of wearable sensor-based techniques6,7. While effective, these methods are constrained by limited availability, user compliance issues, and scalability challenges.
Computer vision-based methods have emerged as a promising alternative for automatic tremor quantification8–13 These techniques leverage ubiquitous devices like smartphones to enable objective assessment of tremor during standardized motor tasks. Existing video-based methods primarily measure tremor frequency10–13, which can be clinically relevant but lack direct alignment with clinical scales and can be difficult to interpret14. In contrast, video-based quantification of tremor amplitude, which relates directly with clinical scales, remains largely underexplored.
In this study, we present a novel video-based approach to estimate postural tremor amplitude and frequency directly from smartphone videos without the need for external calibration references. The approach integrates iris-based scaling, 3D pose estimation, and hand tracking models to convert pixel-based measurements into metric units, enabling analysis in unconstrained environments. Validated in healthy controls and Parkinson’s disease patients, the proposed method demonstrated strong agreement with manual measures, sensitivity to deep brain stimulation (DBS) effects, and strong correlation with clinician-provided tremor scores. This technology is now available as part of VisionMD, our open-source platform for video-based analysis of motor function in movement disorders, providing researchers and clinicians with a practical, accessible, and scalable tool for monitoring tremor and other motor symptoms and has potential applications in clinical trials and telemedicine15.
Figure 1A illustrates the overall pipeline. First, a video of a subject performing the postural tremor task is recorded using a smartphone or other standard camera. During the tasks, one arm is extended forward with palms facing down for 10 s, maintaining straight wrists and comfortably separated fingers4. The recorded video is then processed using a multi-step process: (1) Google Mediapipe, an off-the-shelf, state-of-the-art face landmark estimation model16, is used to estimate the diameter of the left and right irises (in pixels) and the head pose in each video frame. The average iris diameter from the frames with a neural head pose (yaw, pitch, and roll < 10°) is used to compute the pixel-to-millimeter ratio at face depth assuming an iris diameter of 12 mm17,18. (2) MeTrabs, an off-the-shelf, state-of-the-art absolute 3D human pose estimation model, is used to estimate approximate distance (in meters) from the camera to the face and from the camera to the extended hand. These distances are used to adjust the pixel-to-millimeter ratio from face depth to hand depth via the pinhole camera model. (3) A hand pose estimation model is used to localize the extended hand and estimate the positions of the ring, middle, and index fingertips in each video frame. A tremor signal is generated by tracking average fingertips position throughout the video and applying bandpass filter to the position signal with cut-off frequencies between 2 Hz and 10 Hz to eliminate voluntary movements, drift, and high-frequency noise produced by hand pose estimation errors.
Fig. 1. Overview and validation of the proposed video-based tremor quantification method.
A Schematic representation of the multistep pipeline for estimating tremor amplitude and principal frequency from hand postural tremor tasks. The method employs MediaPipe detecting facial landmark and determining the iris-based pixel-to-millimeter calibration, MeTRAbs for estimating subject-to-camera distances, and a YOLO-based hand pose model for tracking the tips of the index, middle, and ring fingers through the video. B Validation results comparing our automatic pipeline versus manually computed measures. B.1 Scatter plot comparing estimated tremor amplitude derived from three hand pose estimation models (YOLO-Nano, YOLO-XL, and WiLoR) against manually annotated data. Intraclass correlation coefficients (ICC3) demonstrate excellent agreement (ICC > 0.99 for all models) between manually annotated and automatic results. B.2 Box and whiskers plots comparing estimate Principal Frequency derived from three hand pose estimation models (YOLO-Nano, YOLO-XL, and WiLoR) against manually annotated data. ANOVA analysis demonstrated not difference between manually annotated and automatic results. C Clinical validation evaluating the sensitivity of video derived tremor measures to DBS therapy. C. 1 Change in tremor amplitude between DBS OFF and DBS ON conditions in 11 Parkinson’s disease patients, showing a significant reduction in tremor amplitude during DBS ON (paired t-test, p = 0.03). C.2 Correlation between video-based tremor amplitude estimates and clinical UPDRS tremor scores across all participants and conditions (Spearman’s ρ = 0.83, p-value ≪ 0.00, Kendall’s τ = 0.72, p-value ≪ 0.001).
We employed several off-the-shelf hand pose estimation models, including Google MediaPipe16,19, MMPose20, and OpenPose21. However, these models produced inconsistent results, often failing to localize the hands in the videos and yielding inaccurate estimation of fingertips positions. This limitation is likely related to the fact that most existing models are primarily trained detect the palm or the dorsum of the hand19, and generalize poorly to other hand postures9,22. During the execution of the hand postural tremor task, these hand surfaces are not visible, resulting in unreliable estimations. Previous studies that successfully applied off-the-shelf models to postural tremor assessment asked the participants to bend their elbows and hold their hands close to the chest8,13, a posture inconsistent with standardized protocols4. Others required precise camera placement so that the hand palm or dorsum are clearly visible, which increases sensitivity to recording conditions and reduces clinical robustness9,10.
To overcome the limitations of existing hand pose estimation models, we developed a custom pipeline consisting of a hand localization model followed by a fingertip landmarks estimation model. Both models were based on the YOLO architecture23. We trained two versions of the fingertip landmarks estimation model: Nano, a lightweight model with ~3 million parameters optimized for real-time inference on consumer devices, and XL, a larger model with >60 million parameters designed to maximize accuracy at the expense of computational speed. Finally, we benchmarked the performance of our custom fingertip landmarks estimation models against WiLoR, a state-of-the-art 3D hand reconstruction framework24.
To evaluate the accuracy of our proposed approach, we collected video data from five healthy controls performing a postural tremor motor task. Videos were recorded at 60 frames per second with a resolution of 1920 × 1080 pixels using an iPhone 12. Participants were instructed to sit upright, extend the right hand toward the camera with the palm facing down, and simulate varying levels of wrist tremor amplitude including none, slight (<1 cm), mild (1–3 cm), moderate (3–10 cm), and severe (>10 cm).
The recorded videos were processed with our pipeline to extract tremor amplitude and principal frequency. The results were compared against manually annotated data derived from frame-by-frame fingertip localization. Figure 1B.1) presents a scatter plot comparing the estimated tremor amplitudes from the three hand pose estimation models against the ground truth amplitudes. Intraclass correlation coefficients (ICC) were computed to quantify agreement between the model-derived and manually estimated amplitudes. All three models demonstrated excellent agreement with ground truth measures (ICC > 0.99), with the WiLoR model showing slightly superior performance in the severe tremor range (>10 cm in amplitude).
Principal frequency estimation was similarly robust across models. Figure 1B.2) presents violin plots of the principal frequency obtained via manual estimation (4.52 ± 0.32 Hz) compared with WiLoR (4.52 ± 0.32 Hz), Nano (4.55 ± 0.37 Hz), and XL (4.51 ± 0.32 Hz) models. ANOVA analysis indicated no significant differences between methods.
Finally, computational performance was benchmarked using an NVIDIA RTX A6000 GPU. The Nano model exhibited the fastest inference speed (20.2 ± 0.1 ms per frame), followed by the XL model (35.8 ± 0.2 ms per frame), and WiLoR being the slowest (72.1 ± 0.6 ms per frame). These results highlight a practical tradeoff: Nano is optimal for lightweight, real-time clinical deployment, while XL and WiLoR provide higher fidelity at increased computational cost.
To evaluate the clinical utility of our proposed approach, we analyzed videos from eleven persons with Parkinson’s disease (pwPD) undergoing stable deep brain stimulation (DBS), a widely used surgical therapy for alleviating motor symptoms in Parkinson’s disease and other neurological conditions. Participants performed the hand postural tremor task with their left and right hands, resulting in a total of 22 videos. All videos were recorded at 60 frames per second and 1920 x 1080 resolution using an iPhone 12. Each participant completed the task under two conditions: DBS turned ON and DBS turned OFF.
The videos were processed using the proposed pipeline with the Nano model only. Figure 1C.1 illustrates the changes in tremor amplitude between the DBS OFF and DBS ON conditions. A paired t-test revealed a significant reduction in tremor amplitude in the DBS ON state compared to the DBS OFF state (3.2 ± 1.2 mm vs. 7.6 ± 7.1 mm, p = 0.03), highlighting the ability of our method to detect DBS-induced improvements in tremor severity.
Additionally, we compared the video-based tremor amplitude with clinician-assigned scores derived from the Unified Parkinson’s Disease Rating Scale (UPDRS) criteria, which grades postural tremor severity on a 0–4 scale, with 0 indicating no tremor and 4 a tremor amplitude >10 cm. Figure 1C.2 depicts the correlation between the video-derived tremor amplitude and the UPDRS clinical scores across all participants and conditions. Spearman and Kendall rank correlation analysis demonstrated a strong positive association (Spearman ρ = 0.83, p < 0.001, Kendall τ = 0.72 m p < 0.001), suggesting that our pipeline provides objective measures of tremor severity that align well with established clinical criteria. Interestingly, the algorithm detected small-amplitude tremor even in cases rated as 0 (no tremor) by clinicians. Careful review of the videos confirmed subtle oscillations in these participants, though such movements may not have influenced the clinician’s overall impression during the exam.
Our method includes several limitations that warrant consideration. First, the camera must be positioned at a distance that ensures the subject’s head and upper body are fully visible while maintaining sufficient resolution to accurately capture the iris and detect subtle tremor movements. In our experiments, a distance of ~10 feet (around 3 m) between the camera and the subject provided an optimal balance. However, this distance is not fixed; the method should work reliably as long as the face and hands are clearly visible, and the video resolution is high enough to capture tremor-induced displacements.
Second, the pixel-to-millimeter calibration assumes a fixed iris diameter of 12 mm, consistent with typical adult anatomical averages17,18. Natural variation in iris diameter (ranging from ~11 mm to 13 mm) may introduce calibration errors of up to 10%. In addition, the method assumes relative stability of the hand-to-camera distance throughout the recording. This assumption can be violated in cases of large-amplitude tremor involving wrist, elbow, or shoulder motion, potentially leading to under- or overestimation of tremor amplitude. These limitations are most relevant for severe tremors and should be considered when interpreting results in such cases. Moreover, patients with severe tremor who are unable to maintain the standardized UPDRS posture (arm fully extended in front of the body) cannot be reliably assessed with the proposed pipeline, as the calibration depends on a stable hand-to-camera distance.
Third, our method relies on absolute 3D human pose estimation model to estimate the subject-to-camera distance. These models generally require knowledge of the camera’s intrinsic parameters to produce accurate metric distance estimates. In this study, employed intrinsic parameters that are valid for many smartphone cameras, but these may not generalize to other camera types. Incorporating device-specific calibration could improve accuracy for diverse hardware configurations. The VisionMD implementation of this method includes the option to input custom intrinsic camera parameters, ensuring its validity across diverse recording devices.
Fourth, the validation in this study was based on manually annotated fingertip trajectories rather than comparison with gold-standard tools such as accelerometers or motion capture (MoCap) systems. While accelerometers and MoCap can detect sub-millimeter displacements, such precision is beyond the spatial resolution of video-based systems, which are inherently limited by the pixel-to-millimeter ratio, which is constrained by multiple factors including the image resolution and the distance from the subject to the camera. Thus, we argue that manual annotation represents a valid reference for this application, as it provides measures aligned with the actual resolution limits of video-based tremor quantification. Importantly, our results demonstrate that the proposed method can robustly detect and quantify sub-centimeter tremor amplitudes, which are clinically relevant for the assessment of tremor severity. Future work should include validation against sensor-based ground truth in larger and more diverse cohorts.
Finally, our clinical validation results are limited due to several factors: the small sample size, the inclusion only participants with Parkinson’s disease, the reliance on a single clinical rater, which may introduce inter- and intra- rater bias, and the fact that most participants received clinical scores ≤ 2. These constraints limit the generalizability of our findings to broader tremor syndromes such as essential tremor or dystonia and restrict evaluation across the full spectrum of tremor severity. Future studies should include larger and more diverse cohorts, which include multiple clinical groups and blinded raters, to strengthen validation and assess robustness across distinct tremor presentations. Nevertheless, despite these limitations, our findings demonstrate that the proposed method provides clinically valuable, objective tremor quantification with strong potential for integration into research and clinical practice.
Methods
Participants and data collection
Five healthy control participants (5 males; age range: 22–41 years) and eleven persons with Parkinson’s disease (pwPD) under stable deep brain stimulation (DBS) therapy were recruited for this study. The demographic information for the the pwPD has been described before15. Inclusion criteria for pwPD included a confirmed diagnosis of Parkinson’s disease based on Movement Disorder Society criteria and a stable DBS program. This research project was performed in accordance with the Declaration of Helsinki, all participants provided written informed consent under a protocol approved by the University of Florida Institutional Review Board.
For the healthy controls, videos were recorded to evaluate the accuracy of the proposed pipeline. Participants were seated ~10 feet (3 meters) from the camera in a well-lit room and were instructed to extend their right arm forward with the palm facing downward while simulating five tremor levels: (1) no tremor, (2) slight tremor (<1 cm amplitude), (3) mild tremor (1–3 cm amplitude), (4) moderate tremor (3–10 cm amplitude), and (5) severe tremor (>10 cm amplitude). Each condition was demonstrated prior to recording to standardize participant performance. During the recording, a printout of a checkerboard pattern with squares measuring 50 mm was placed next to the subject’s hand. Videos were recorded at 60 frames per second (fps) with a resolution of 1920 x 1080 pixels using an iPhone 12 mounted on a tripod.
For pwPD, participants performed the hand postural tremor task under a DBS ON/OFF protocol. Videos were first recorded with DBS turned ON. The stimulator was then turned OFF, and after a 30 min adaptation period to allow symptom reemergence, the task was repeated. Each participant performed the task with both left and right hands separately. A trained rater, non-blinded to the patient condition or DBS status, provided a score on a 0–4 scale following the Unified Parkinson’s Disease Rating Scale (UPDRS) criteria. Videos were recorded at 60 frames per second (fps) with a resolution of 1920 x 1080 pixels using an iPhone 12 mounted on a tripod. All recordings were performed in a clinical examination room at the Norman Fixel Institute for Neurological Disorders, part of UF Health, Gainesville, FL, USA.
Automatic video processing pipeline
The proposed pipeline consists of three main stages:
-
Iris-based pixel-to-millimeter calibration: Facial landmarks and the left and right irises were detected using Google’s MediaPipe (v 0.10.21)16. For each video frame, the head pose was computed from the facial landmarks and frames with neutral head orientation (yaw, pitch, roll < 10°) were selected for calibration.
The average irises diameter obtained for the frames with neutral head orientation was used to compute the pixel-to-millimeter ratio at face depth ( assuming an iris diameter of 12 mm:1 - Depth adjustment using 3D human pose estimation: Absolute distances from the camera to the participant’s face ( and hand ( were estimated using MeTRAbs (Metric-Transformer for Absolute 3D Human Pose Estimation, v1.0)25. The pixel-to-millimeter ratio was adjusted from face depth to hand depth ( using the pinhole camera model:
2 Hand pose estimation and tremor analysis: Hand localization was performed using a YOLOv11 object detection model trained on a custom annotated hand dataset. Once hands were localized, we employed a hand pose estimation model to detect the position of the ring, middle, and index fingertips. For this, we compared the WiLoR model, a state-of-the-art 3 d hand pose estimation model, and or our custom YOLOv11 pose estimation models. Two model sizes were used: Nano (3 M parameters) for fast processing and XL (60 M parameters) for higher accuracy.
Fingertip trajectories were tracked across video frames. The average x and y displacement signals were band-pass filtered (2–10 Hz) using a 2nd-order Butterworth filter to isolate tremor-related motion. Tremor amplitude was computed as the maximum peak-to-peak displacement, scaled to millimeters using the pixel-to-mm ratio (). Principal frequency was determined from the power spectral density (Welch’s method, segment length: 256 samples, overlap: 128 samples).
Model training
The custom hand localization and hand pose estimation models employed in this study were trained by fine-tuning existing models pretrained on the COCO dataset26. For training, we assembled a dataset of 1,800 images extracted from videos of individuals performing the hand pronation/supination task. Similar to the postural tremor tasks, this task requires participants to sit with the arm extended in front of their body. However, instead of maintaining a static posture, participants repeatedly rotate the palms upward and downward for 10 s, generating a dataset with substantial variability in hand position and orientation. All frames were manually annotated with bounding boxes around the hands and fingertip coordinates using Roblox video annotation software.
Data augmentation techniques—including random rotations (±15°), horizontal flipping, scaling ( ± 10%), brightness/contrast adjustments, and Gaussian noise addition—were applied to improve model robustness to varying recording conditions. The dataset was split into 80% for training and 20% for validation.
The models were implemented using the PyTorch framework (v2.2) and trained for 100 epochs using the Adam optimizer with a variable learning rate and early stopping based on validation loss to prevent overfitting. Training was performed on an NVIDIA RTX A6000 GPU.
The WiLoR model was used as published, without additional training, to serve as a benchmark for comparison.
Validation against manual measures
For accuracy evaluation, healthy control videos were manually annotated frame-by-frame by a trained observer using Roblox video annotation software. The positions of the index, middle, and ring fingertips were manually identified at each video frame, and a reference tremor signal generated by averaging the fingertip trajectories. Tremor amplitude (in pixels) was derived from the signal’s range, and a checkerboard pattern positioned next to the hand served to transform tremor amplitude from pixels to mm. The signal’s principal frequency was also estimate following the procedure described above.
To assess agreement, tremor amplitude estimates obtained from the automatic pipelines were compared with manually derived measures using the intraclass correlation coefficient (ICC 3,1 – fixed raters), which quantifies both consistency and absolute agreement between methods. In addition, principal frequency estimates were statistically compared across methods using analysis of variance (ANOVA) to evaluate whether significant differences existed between manual and model-derived results.
Statistical analysis
Paired t-tests were used to assess differences in tremor amplitude between DBS ON and OFF conditions in pwPD. Spearman’s and Kendall’s rank correlation coefficient were computed to evaluate associations between video-based tremor amplitudes and clinician-assigned scores (0–4 scale). A significance threshold of p < 0.05 was applied. Statistical analyses were performed in Python using the Pingouin library27.
Acknowledgements
D.L.G. received funding from the Norman Fixel Institute for Neurological Diseases and the University of Florida Intelligent Clinical Care Center.
Author contributions
D.L.G.: Writing—review and editing; editing, Writing—original draft, Visualization, Methodology, Investigation, Funding acquisition, Formal analysis, Data curation, Conceptualization.
Data availability
The datasets used and/or analyzed during the current study are available from the corresponding author on request. The underlying code and models used for video processing and described in this study are available as part of VisionMD (www.VisionMD.ai).
Competing interests
The author declares no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Bailey, G. A., Rawlings, A., Torabi, F., Pickrell, O. & Peall, K. J. Adult-onset idiopathic dystonia: a national data-linkage study to determine epidemiological, social deprivation, and mortality characteristics. Eur. J. Neurol.29, 91–104 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Louis, E. D. & Ottman, R. How many people in the USA have essential tremor? Deriving a population estimate based on epidemiological data. Tremor Hyperkinetic Mov.4, 259 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Marras, C. et al. Prevalence of Parkinson’s disease across North America. Npj Park. Dis.4, 21 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Goetz, C. G. et al. Movement disorder society-sponsored revision of the unified Parkinson’s disease rating scale (MDS-UPDRS): scale presentation and clinimetric testing results: MDS-UPDRS: clinimetric assessment. Mov. Disord.23, 2129–2170 (2008). [DOI] [PubMed] [Google Scholar]
- 5.Majersik, J. J. et al. A Shortage of Neurologists – We Must Act Now: A Report From the AAN 2019 Transforming Leaders Program. Neurology96, 1122–1134 (2021). [DOI] [PubMed] [Google Scholar]
- 6.Vescio, B., Quattrone, A., Nisticò, R., Crasà, M. & Quattrone, A. Wearable Devices for Assessment of Tremor. Front. Neurol.12, 680011 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Lu, R. et al. Evaluation of wearable sensor devices in parkinson’s disease: a review of current status and future prospects. Park. Dis.2020, 4693019 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Friedrich, M. U. et al. Validation and application of computer vision algorithms for video-based tremor analysis. Npj Digit. Med.7, 165 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Wolke, R., Welzel, J., Maetzler, W., Deuschl, G. & Becktepe, J. Validity of tremor analysis using smartphone compatible computer vision frameworks. Sci. Rep.15, 13391 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Güney, G. et al. Video-based hand movement analysis of parkinson patients before and after medication using high-frame-rate videos and mediapipe. Sensors22, 7992 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Williams, S. et al. Accuracy of smartphone video for contactless measurement of hand tremor frequency. Mov. Disord. Clin. Pract.8, 69–75 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Fois, A. F., Mahant, N., Vucic, S. & Fung, V. S. C. Measuring tremor—a comparison of automated video analysis, neurophysiology, and clinical rating. Mov. Disord.36, 2962–2963 (2021). [DOI] [PubMed] [Google Scholar]
- 13.Pintea, S. L. et al. Hand-tremor frequency estimation in videos. In Computer Vision – ECCV 2018 Workshops: Munich, Germany, September 8-14, 2018, Proceedings, Part VI, 213–228 (Springer-Verlag, Berlin, Heidelberg).
- 14.Casamento-Moran, A. et al. Quantitative separation of tremor and Ataxia in essential tremor. Ann. Neurol.88, 375–387 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Acevedo, G. et al. VisionMD: an open-source tool for video-based analysis of motor function in movement disorders. Npj Park. Dis.11, 1–5 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Lugaresi, C. et al. MediaPipe: A framework for perceiving and processing reality. In Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR 2019) 4 (IEEE, 2019).
- 17.Shoch, D. E. Histology of the human eye: an Atlas and textbook. JAMA219, 221 (1972). [Google Scholar]
- 18.Forrester, J. V., Dick, A. D., McMenamin, P. G., Roberts, F. & Pearlman, E. Chapter 1 - Anatomy of the eye and orbit. In The Eye (Fourth Edition) (eds. Forrester, J. V., Dick, A. D., McMenamin, P. G., Roberts, F. & Pearlman, E.) 1-102.e2 (W.B. Saunders, 2016).
- 19.Zhang, F. et al. MediaPipe hands: on-device real-time hand tracking. arXiv10.48550/arXiv.2006.10214 (2020).
- 20.Chen, K. et al. MMDetection: Open MMLab detection toolbox and benchmark. arXiv10.48550/arXiv.1906.07155 (2019).
- 21.Cao, Z., Simon, T., Wei, S.-E. & Sheikh, Y. Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. in 7291–7299 (2017). [DOI] [PubMed]
- 22.Zimmermann, C. et al. FreiHAND: A dataset for markerless capture of hand pose and shape from single RGB images. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) 813–822 (IEEE, 2019).
- 23.Terven, J., Córdova-Esparza, D.-M. & Romero-González, J.-A. A comprehensive review of YOLO architectures in computer vision: from YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr.5, 1680–1716 (2023). [Google Scholar]
- 24.Potamias, R. A., Zhang, J., Deng, J. & Zafeiriou, S. WiLoR: End-to-end 3D hand localization and reconstruction in-the-wild. In 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 12242–12254 (IEEE, 2025).
- 25.Sárándi, I., Hermans, A. & Leibe, B. Learning 3D Human pose estimation from dozens of datasets using a geometry-aware autoencoder to bridge between skeleton formats. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2955–2965 (IEEE, 2023).
- 26.Lin, T.-Y. et al. Microsoft COCO: common objects in context. In Computer Vision – ECCV 2014 (eds. Fleet, D., Pajdla, T., Schiele, B. & Tuytelaars, T.) 740–755 (Springer International Publishing, Cham, 2014).
- 27.Vallat, R. Pingouin: statistics in Python. J. Open Source Softw.3, 1026 (2018). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets used and/or analyzed during the current study are available from the corresponding author on request. The underlying code and models used for video processing and described in this study are available as part of VisionMD (www.VisionMD.ai).

