STROKEVISION-BENCH: A MULTIMODAL VIDEO AND 2D POSE BENCHMARK FOR TRACKING STROKE RECOVERY

David Robinson; Animesh Gupta; Rizwan Qureshi; Qiushi Fu; Mubarak Shah

doi:10.1109/mlsp62443.2025.11204266

. Author manuscript; available in PMC: 2026 Feb 25.

Published in final edited form as: IEEE Int Workshop Mach Learn Signal Process. 2025 Oct 24;2025:10.1109/mlsp62443.2025.11204266. doi: 10.1109/mlsp62443.2025.11204266

STROKEVISION-BENCH: A MULTIMODAL VIDEO AND 2D POSE BENCHMARK FOR TRACKING STROKE RECOVERY

David Robinson ¹, Animesh Gupta ¹, Rizwan Qureshi ¹, Qiushi Fu ², Mubarak Shah ¹

PMCID: PMC12930242 NIHMSID: NIHMS2142189 PMID: 41743697

Abstract

Despite advancements in rehabilitation protocols, clinical assessment of upper extremity (UE) function after stroke largely remains subjective, relying heavily on therapist observation and coarse scoring systems. This subjectivity limits the sensitivity of assessments to detect subtle motor improvements, which are critical for personalized rehabilitation planning. Recent progress in computer vision offers promising avenues for enabling objective, quantitative, and scalable assessment of UE motor function. Among standardized tests, the Box and Block Test (BBT) is widely utilized for measuring gross manual dexterity and tracking stroke recovery, providing a structured setting that lends itself well to computational analysis. However, existing datasets targeting stroke rehabilitation primarily focus on daily living activities and often fail to capture clinically structured assessments such as block transfer tasks. Furthermore, many available datasets include a mixture of healthy and stroke-affected individuals, limiting their specificity and clinical utility. To address these critical gaps, we introduce StrokeVision-Bench, the first-ever dedicated dataset of stroke patients performing clinically structured block transfer tasks. StrokeVision-Bench comprises 1,000 annotated videos categorized into four clinically meaningful action classes, with each sample represented in two modalities: raw video frames and 2D skeletal keypoints. We benchmark several state-of-the-art video action recognition and skeleton-based action classification methods to establish performance baselines for this domain and facilitate future research in automated stroke rehabilitation assessment.

1. INTRODUCTION

Stroke is the leading cause of serious chronic physical disability in the United States, impacting millions of individuals each year [34]. Among stroke survivors, approximately 95% experience upper extremity (UE) dysfunction [7], with 30 to 66% exhibiting a significantly impaired ability to use the affected arm [35]. Standardized routine Outcome Measures (OMs) of UE impairment are critical for informing clinical decisions about rehabilitation protocols and for tracking the progression of sensorimotor deficits [25, 30]. However, commonly used OMs often fail to provide sufficient evidence to help therapists develop and adapt personalized care plans. This is because existing assessment tools rely on subjective grading of movement quality using a few discrete levels (e.g., Action Research Arm Test) [2], or the number of blocks transferred during a fixed interval [29]. As a result, these OMs lack the sensitivity needed to detect behavioural changes or to directly target specific functional deficits.

To address these shortcomings, considerable research has focused on developing wearable assessment tools that measure hand and arm kinematics [22]. However, these technologies are not widely adopted in clinical settings due to their high cost and the need for specialized training for clinical staff. To overcome these challenges in the research, we aim to develop a solution based entirely on computer vision techniques that are affordable and easily deployable in real-world clinical environments. Our goal is to develop a computer vision–based diagnostic system that runs on mobile devices, enabling objective progress tracking without costly equipment or clinical expertise [20].

Video action understanding remains a fundamental and challenging problem in computer vision, requiring models to recognize and classify complex human actions from dynamic visual content [11, 31]. Video action understanding requires learning rich spatio-temporal features from frame sequences, capturing human interactions, object dynamics, and overall context [26, 27]. Transformer-based vision encoders (ViTs)[4, 21] have emerged as the gold standard for modeling spatio-temporal information and capturing global contextual relationships in visual data. Building on this foundation, Swin Transformers[15] enhance the standard self-attention mechanism by introducing a hierarchical shifted window approach, enabling more efficient and scalable learning of spatial representations. In the clinical setting, using the 2D skeleton serves two purposes: (1) it provides rich information about the mobility of different parts of the body. (2) it does not reveal the privacy attributes of the patients [36]. We expanded our benchmarking setup upon advances in skeleton-based representations and evaluated MotionBert [40], PoseConv3D [5], and MS-G3D [17] on our proposed StrokeVision-Bench.

Despite the strong performance of 2D skeleton action classification methods, their effectiveness and efficiency on medical datasets remain open questions. Building on this motivation, we introduce StrokeVision-Bench, a clinical video dataset consisting of human-centric recordings. These videos capture patients performing the task of transferring blocks from one location to another, recorded both before and after the clinical sessions. The individuals featured in the dataset suffer from upper extremity (UE) dysfunction and regularly attend rehabilitation sessions for weekly physical treatment aimed at improving the mobility of their body parts. To support this objective, it is necessary to monitor the individual’s progress across multiple weeks. Two types of information are particularly valuable for doctors: (1) the total number of objects transferred by the individual before and after the sessions, and (2) the joint angle between the shoulder and abdomen, which provides insight into improvements in the body movement. We benchmark existing state-of-the-art video action and 2D skeletons classification methods in this study. Our dataset is annotated, where each video is labelled into one of four categories: (i) Grasping, (ii) Non-task Movement, (iii) Transport with Object, (iv) and Transport without Object.

To summarize, we make the following key contributions in this work:

We introduced StrokeVision-Bench, the first-ever dataset of 1000 videos covering four subactions of the clinical box and block test for computer vision based analysis. Each video includes both raw RGB frames and corresponding 2D skeleton poses, computed using Sapiens [13] to enable fine-grained analysis of patient movement. By including recordings before and after the sessions, our dataset captures changes in movement speed across different body parts, which is essential to assess rehabilitation progress, and also, training models to detect subtle improvements in mobility.
StrokeVision-Bench is the first stroke rehabilitation benchmarking dataset that focuses exclusively on stroke patients, filling a gap in existing work and enabling the systematic evaluation of different methods for tracking patient progress.
We evaluated four video action classification methods, namely R3D[33], R2Plus1D[33], Video MViT [14], and Video Swin Transformer [16]. Additionally, we evaluated 2D skeleton-based action classification methods, including MotionBERT [40], PoseConv3D [5], and MS-G3D [17]. We expect that this dataset will further advance research in automated stroke-prediction.

2. RELATED WORK

Video-based action recognition has shown remarkable success in modeling complex human activities in everyday environments [18, 19, 27]. Translating these techniques to medical domains, particularly stroke rehabilitation, demands curated datasets and models capable of handling clinically meaningful movements and patient variability.

Expensive and training-intensive methods:

Existing approaches are costly and require specialized clinician training. Previous methods [38, 39, 10, 8, 1, 3] have developed technologies to measure hand and arm kinematics in individuals with sensorimotor impairments. However, these methods rely on multiple wearable sensors or expensive motion capture systems, making them costly and time-consuming to deploy. They also require customized equipment designed for laboratory-based tasks that are unfamiliar to clinicians [1, 9, 6, 37], further increasing the need for specialized training. In many cases, the complexity of setup and interpretation limits their adoption outside of controlled research environments. Moreover, such systems often fail to generalize well to in-home or community-based settings where real-world rehabilitation occurs. To overcome these challenges, we propose a new approach based on video action classification that uses low-cost cameras to track and record patient mobility, removing the dependency on wearable devices.

Existing video action classification datasets for stroke:

Existing video action classification datasets for stroke rehabilitation are limited. StrokeRehab [12] offers large-scale recordings of daily activities, such as brushing and combing, collected from 20 healthy individuals and 31 stroke-impaired patients. Another study [23] uses videos of healthy subjects to identify and quantify movement abnormalities, but models trained exclusively on healthy data suffer substantial performance degradation when applied to stroke-impaired videos. By contrast, our StrokeVision-Bench centers on a standardized clinical task: we collect video of each patient performing the Box and Block Test both before and after rehabilitation sessions, thereby enabling direct evaluation of changes in motor function.

Existing video and 2D skeleton action classification methods:

Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have both been widely applied to video action recognition [28]. Duan et al. [5] proposed spatio-temporal CNN architectures, and Li et al. [14] together with Liu et al. [15] introduced transformer-based models that capture long-range dependencies across frames. For skeleton-based classification, MotionBERT [40] uses 2D joint sequences but relies on accurate pose estimates that may degrade in clinical recordings, while PoseConv3D [5] attains high accuracy on controlled benchmarks but has not been evaluated on patient videos. MS-G3D [17] achieves strong results on standard action datasets but depends on dense annotations that are costly to obtain in medical settings.

3. STROKEVISION BENCHMARK

3.1. Curation of StrokeVision

We collected videos of patients, where each patient is performing the task of moving a block from one box to another. We curated videos recorded both pre- and post-session, enabling evaluation of the network’s ability to understand motion differences. Ideally, a patient demonstrates improved mobility after the session, resulting in a noticeable difference in motion speed between the two recordings. Each video was manually annotated into four action classes. StrokeVision includes a total of 1000 videos, each with a duration of 1 second and containing 30 frames. Curating an equal number of videos of distinct actions is challenging because stroke-impaired patients often struggle to transfer the block between boxes, resulting predominantly in grasping or non-task movements (Fig. 2b). Additionally, we created train-test splits while ensuring there was no data leakage between them. For more details regarding the number of videos in each action class, refer to Fig. 2b, and for visualizations of our videos in StrokeVision, refer to Fig. 2a.

Figure 2: — Overview of StrokeVision-Bench: (a) Visualization of raw RGB frames alongside overlaid 2D skeleton keypoints, annotated with their corresponding action classes; (b) Train and validation sample counts for each action category in StrokeVision-Bench.

3.2. Benchmarking Setup

As detailed in Section 1, our goal is to compare state-of-the-art video action and 2D skeleton-based recognition methods. We denote an input video as:

V \in R^{T \times C \times H \times W},

where $T$ is the frame count, $C$ the number of channels, and $H$ and $W$ the frame height and width. Each video is classified into one of four action categories using either raw frames or 2D skeleton keypoints. To obtain 2D skeleton data, we apply Sapiens [13] a state-of-the-art model by Meta, yielding

V_{J} \in R^{T \times J \times 2},

where $J$ keypoints and the last dimension represents the ( $x$ , $y$ ) coordinates.

4. EXPERIMENTS

Models:

We examined multiple CNN and transformer-based networks for action classification using raw frames. For CNNs, we evaluated residual-based network architectures introduced in [33], including 3D ResNet (referred to as R3D), and ResNet with (2+1)D convolutions (referred to as R(2+1)D). For vision transformers, we selected Video MViT [14], and Video Swin Transformer [16]. For 2D keypoint-based methods, we evaluated MotionBert [40], PoseConv3D [5], and MS-G3D [17]. We initialized all frame-based networks with weights pretrained on the Kinetics-400 dataset. For 2D skeleton-based models, we used weights pretrained on NTU RGB+D, with the exception of MS-G3D because its pretraining employed a different set of skeleton keypoints.

Experimental Setup:

We evaluated several models finetuned on our proposed StrokeVision-Bench using accuracy as the evaluation metric. All our experiments are implemented in PyTorch [24] and executed on a single NVIDIA V100 GPU with 32 GB of VRAM.

5. BENCHMARKING ON STROKEVISION-BENCH

Overall Results:

We evaluated four video-based models on raw frames and three skeleton-based methods on StrokeVision-Bench, reporting the results in Table 1. First, convolutional networks outperform vision transformers on our small and challenging dataset. For instance, accuracy falls from 87 % for R3D [33] to 74% for the Video Swin Transformer [15], reflecting the transformers’ greater data requirements for learning effective global representations. Second, among skeleton-based approaches, MotionBert [40] achieves the highest accuracy at 84%.

Table 1:

Accuracy (mean ± standard deviation) of different models on StrokeVision-Bench. Frame-based methods (R3D, R2Plus1D, Video MViT, Video Swin Transformer) are pretrained on Kinetics-400, while skeleton-based methods (MotionBERT, PoseConv3D, MS-G3D) use NTU RGB+D or no pretraining.

Model	Pretrained Weights	Modality	Accuracy (%)
R3D [33]	Kinetics-400	Frames	87.68 ± 2.48
R2Plus1D [33]	Kinetics-400	Frames	86.96 ± 1.20
Video MViT [14]	Kinetics-400	Frames	77.14 ± 2.15
Video Swin Transformer [16]	Kinetics-400	Frames	78.93 ± 2.15
MotionBERT [40]	NTU RGB+D	2D Joints	84.29 ± 1.35
PoseConv3D [5]	NTU RGB+D	2D Joints	68.93 ± 3.06
MS-G3D [17]	None	2D Joints	78.39 ± 1.72

Open in a new tab

Raw frames or 2D Skeleton?

We analyze the performance of two modalities, raw video frames and 2D skeletons, in our StrokeVision-Bench. Overall, the R3D [33] model achieves the highest accuracy at 87%, outperforming MotionBert’s [40] 84%. For clinical applications focused solely on quantifying block transfers before and after treatment, frame-based CNNs such as R3D are the best choice. If the clinic also wishes to assess improvements in patients’ overall mobility, skeleton-based methods such as MotionBert are preferable, since it deliver nearly equivalent accuracy while providing joint-level movement data that can be used to evaluate mobility improvements.

Vision Transformers vs. 2D Skeleton Methods:

Our experiments confirm that vision transformers require large datasets to achieve strong performance, as demonstrated in DeiT [32]. In clinical settings, where data curation is challenging, transformers are therefore a suboptimal choice. In contrast, 2D skeleton–based methods learn robust representations even from limited data and consistently outperform vision transformers. Furthermore, skeleton–based approaches preserve patient privacy attributes and provide detailed, joint-level mobility information.

Confusion Matrices Analysis:

Across all seven models, “Non-task movement” is easy to recognize, with most architectures exceeding 85% accuracy, while “Transport with block” versus “Transport without block” remains challenging (Fig. 3). The 3D CNNs R3D and R2Plus1D show uniform performance, approximately 92% on Non-task movement, 89% on Grasping, and 86–93% on transport actions, while attention-based models Video MViT and Video Swin Transformer suffer in classifying the two transport variants (only 57.1% and 67.9% correct on “without block”). The 2D skeleton-based networks takes a middle ground: MotionBert achieves CNN-level accuracy (87.1% Non-task, 80.6% Grasping, 84.4% and 90.0% transports), PoseConv3D boosts Grasping (80.6%) but struggles on Non-task (74.2%) and both transports (63–66%), and MS-G3D offers the most balanced pose-driven performance (92.9% Grasping, 86.2% with-block, 71.4% without-block) while still finding “transport without block” the complex case.

Figure 3: — Confusion matrices for seven models (R3D, R2Plus1D, Video MViT, Video Swin, MotionBert, PoseConv3D, MS-G3D) on four StrokeVision-Bench actions. Most models excel in “Non-task movement” and “Grasping” but show varying confusion between “Transport with block” and “Transport without block.”

6. LIMITATIONS

While StrokeVision-Bench represents a significant advance in benchmarking action recognition methods for stroke patients, it has several limitations. First, the dataset includes only the Box and Block Test, leaving other clinical assessments of patient mobility unexamined. Second, the current dataset size is limited and would benefit from expansion through partnerships with clinical centers.

7. CONCLUSION

In this paper, we introduce StrokeVision-Bench, a new dataset of 1000 videos documenting stroke rehabilitation patients performing the Box and Block Test. We introduce the first dataset exclusively for stroke-impaired patients performing the standard test. By carefully curating recordings made before and after rehabilitation sessions, we capture the changes in motion speed that indicate functional improvement. The dataset includes both raw-frame and 2D-skeleton modalities, allowing evaluation of convolutional neural networks, vision transformers and skeleton-based methods. Despite the dataset’s small size and inherent challenges, CNN-based models achieve the highest overall accuracy. Skeleton-based approaches perform comparably while preserving patient privacy and providing detailed insights into joint-level mobility.

Figure 1: — Overview of StrokeVision-Bench. We collect 1K short videos of stroke patients performing the block-transfer test and extract two modalities: (top) Raw RGB frames and (bottom) 2D skeletal joint trajectories. Each modality is fed into a dedicated encoder, either a video-based model (e.g., CNN or Vision Transformer) or a skeleton-based network, for action classification.

8. REFERENCES

[1].Balasubramanian Sivakumar, Wei Ruihua, Herman Richard, and He Jiping. Robot-measured performance metrics in stroke rehabilitation. In 2009 ICME International Conference on Complex Medical Engineering, pages 1–6. IEEE, 2009. [Google Scholar]
[2].Brunner Iris C, Andrinopoulou Eleni-Rosalina, Selles Ruud, Lundquist Camilla Biering, and Pedersen Asger Roer. External validation of a dynamic prediction model for upper limb function after stroke. Archives of Rehabilitation Research and Clinical Translation, 6(1):100315, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Chen Shuya, Lewthwaite Rebecca, Schweighofer Nicolas, and Winstein Carolee J. Discriminant validity of a new measure of self-efficacy for reaching movements after stroke-induced hemiparesis. Journal of Hand Therapy, 26(2):116–123, 2013. [DOI] [PubMed] [Google Scholar]
[4].Dosovitskiy Alexey, Beyer Lucas, Kolesnikov Alexander, Weissenborn Dirk, Zhai Xiaohua, Unterthiner Thomas, Dehghani Mostafa, Minderer Matthias, Heigold Georg, Gelly Sylvain, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. [Google Scholar]
[5].Duan Haodong, Zhao Yue, Chen Kai, Lin Dahua, and Dai Bo. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2969–2978, 2022. [Google Scholar]
[6].Dukelow Sean P, Herter Troy M, Moore Kimberly D, Demers Mary Jo, Glasgow Janice I, Bagg Stephen D, Norman Kathleen E, and Scott Stephen H. Quantitative assessment of limb position sense following stroke. Neurorehabilitation and neural repair, 24(2):178–187, 2010. [DOI] [PubMed] [Google Scholar]
[7].Gowland Carolyn, DeBruin Hubert, Basmajian John V, Plews Nancy, and Burcea Ion. Agonist and antagonist activity during voluntary upper-limb movement in patients with stroke. Physical therapy, 72:624–624, 1992. [DOI] [PubMed] [Google Scholar]
[8].Gupta Animesh, Hasan Irtiza, Prasad Dilip K, and Gupta Deepak K. Data-efficient training of cnns and transformers with coresets: A stability perspective. arXiv preprint arXiv:2303.02095, 2023. [Google Scholar]
[9].Gupta Animesh, Parmar Jay, Dave Ishan Rajendrakumar, and Shah Mubarak. From play to replay: Composed video retrieval for temporally fine-grained videos. arXiv preprint arXiv:2506.05274, 2025. [Google Scholar]
[10].Hebert Jacqueline S and Lewicke Justin. Case report of modified box and blocks test with motion capture to measure prosthetic function. Journal of Rehabilitation Research & Development, 49(8), 2012. [Google Scholar]
[11].Hutchinson Matthew S and Gadepally Vijay N. Video action understanding. IEEE Access, 9:134611–134637, 2021. [Google Scholar]
[12].Kaku Aakash, Liu Kangning, Parnandi Avinash, Rajamohan Haresh Rengaraj, Venkataramanan Kannan, Venkatesan Anita, Wirtanen Audre, Pandit Natasha, Schambra Heidi, and Fernandez-Granda Carlos. Strokerehab: A benchmark dataset for subsecond action identification. Advances in neural information processing systems, 35:1671–1684, 2022. [PMC free article] [PubMed] [Google Scholar]
[13].Khirodkar Rawal, Bagautdinov Timur, Martinez Julieta, Zhaoen Su, James Austin, Selednik Peter, Anderson Stuart, and Saito Shunsuke. Sapiens: Foundation for human vision models. arXiv preprint arXiv:2408.12569, 2024. [Google Scholar]
[14].Y Li CY H Fan Wu, Mangalam K, Xiong B, Malik J, and Feichtenhofer C. Mvitv2: Improved multiscale vision transformers for classification and detection. arxiv. arXiv preprint arXiv:2112.01526, 6(8), 2021. [Google Scholar]
[15].Liu Ze, Lin Yutong, Cao Yue, Hu Han, Wei Yixuan, Zhang Zheng, Lin Stephen, and Guo Baining. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. [Google Scholar]
[16].Liu Ze, Ning Jia, Cao Yue, Wei Yixuan, Zhang Zheng, Lin Stephen, and Hu Han. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022. [Google Scholar]
[17].Liu Ziyu, Zhang Hongwen, Chen Zhenghao, Wang Zhiyong, and Ouyang Wanli. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 143–152, 2020. [Google Scholar]
[18].Lo Chung-Ming and Hung Peng-Hsiang. Predictive stroke risk model with vision transformer-based doppler features. Medical Physics, 51(1):126–138, 2024. [DOI] [PubMed] [Google Scholar]
[19].Mahmood Ahmad, Vayani Ashmal, Naseer Muzammal, Khan Salman, and Khan Fahad Shahbaz. Vurf: A general-purpose reasoning and self-refinement framework for video understanding. arXiv preprint arXiv:2403.14743, 2024. [Google Scholar]
[20].Mainali Shraddha, Darsie Marin E, and Smetana Keaton S. Machine learning in action: stroke diagnosis and outcome prediction. Frontiers in neurology, 12:734345, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Narnaware Vishal, Vayani Ashmal, Gupta Rohit, Swetha Sirnam, and Shah Mubarak. Sb-bench: Stereotype bias benchmark for large multimodal models. arXiv preprint arXiv:2502.08779, 2025. [Google Scholar]
[22].O’Brien Megan K, Shin Sung Y, Khazanchi Rushmin, Fanton Michael, Lieber Richard L, Ghaffari Roozbeh, Rogers John A, and Jayaraman Arun. Wearable sensors improve prediction of post-stroke walking function following inpatient rehabilitation. IEEE Journal of Translational Engineering in Health and Medicine, 10:1–11, 2022. [Google Scholar]
[23].Parnandi Avinash, Kaku Aakash, Venkatesan Anita, Pandit Natasha, Fokas Emily, Yu Boyang, Kim Grace, Nilsen Dawn, Fernandez-Granda Carlos, and Schambra Heidi. Data-driven quantitation of movement abnormality after stroke. Bioengineering, 10(6):648, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Paszke Adam, Gross Sam, Chintala Soumith, Chanan Gregory, Yang Edward, DeVito Zachary, Lin Zeming, Desmaison Alban, Antiga Luca, and Lerer Adam. Automatic differentiation in pytorch. 2017. [Google Scholar]
[25].Potter Kirsten, George D Fulk Yasser Salem, and Sullivan Jane. Outcome measures in neurological physical therapy practice: part i. making sound decisions. Journal of Neurologic Physical Therapy, 35(2):57–64, 2011. [DOI] [PubMed] [Google Scholar]
[26].Qureshi Rizwan, Sapkota Ranjan, Shah Abbas, Muneer Amgad, Zafar Anas, Vayani Ashmal, Shoman Maged, Eldaly Abdelrahman, Zhang Kai, Sadak Ferhat, et al. Thinking beyond tokens: From brain-inspired intelligence to cognitive foundations for artificial general intelligence and its societal impact. arXiv preprint arXiv:2507.00951, 2025. [Google Scholar]
[27].Raza Shaina, Qureshi Rizwan, Zahid Anam, Fioresi Joseph, Sadak Ferhat, Saeed Muhammad, Sapkota Ranjan, Jain Aditya, Zafar Anas, Hassan Muneeb Ul, et al. Who is responsible? the data, models, users or regulations? responsible generative ai for a sustainable future. arXiv preprint arXiv:2502.08650, 2025. [Google Scholar]
[28].Saeed Muhammed, Raza Shaina, Vayani Ashmal, Abdul-Mageed Muhammad, Emami Ali, and Shehata Shady. Beyond content: How grammatical gender shapes visual representation in text-to-image models. arXiv preprint arXiv:2508.03199, 2025. [Google Scholar]
[29].Stinear Cathy M, Smith Marie-Claire, and Byblow Winston D. Prediction tools for stroke rehabilitation. Stroke, 50(11):3314–3322, 2019. [DOI] [PubMed] [Google Scholar]
[30].Jane E Sullivan A Andrews Williams, Lanzino Desiree, Peron Aimee, and Potter Kirsten A. Outcome measures in neurological physical therapy practice: part ii. a patient-centered process. Journal of Neurologic Physical Therapy, 35(2):65–74, 2011. [DOI] [PubMed] [Google Scholar]
[31].Thawakar Omkar, Vayani Ashmal, Khan Salman, Cholakal Hisham, Anwer Rao M, Felsberg Michael, Baldwin Tim, Xing Eric P, and Khan Fahad Shahbaz. Mobillama: Towards accurate and lightweight fully transparent gpt. arXiv preprint arXiv:2402.16840, 2024. [Google Scholar]
[32].Touvron Hugo, Cord Matthieu, Douze Matthijs, Massa Francisco, Sablayrolles Alexandre, and Jegou Herve. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, volume 139, pages 10347–10357, July 2021. [Google Scholar]
[33].Tran Du, Wang Heng, Torresani Lorenzo, Ray Jamie, LeCun Yann, and Paluri Manohar. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018. [Google Scholar]
[34].AS Update. Heart disease and stroke statistics–2017 update. Circulation, 135:e146–e603, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[35].Van der Lee Johanna H, Wagenaar Robert C, Lankhorst Gustaaf J, Vogelaar Tanneke W, Devillé Walter L, and Bouter Lex M. Forced use of the upper extremity in chronic stroke patients: results from a single-blind randomized clinical trial. Stroke, 30(11):2369–2375, 1999. [DOI] [PubMed] [Google Scholar]
[36].Vayani Ashmal, Dissanayake Dinura, Watawana Hasindri, Ahsan Noor, Sasikumar Nevasini, Thawakar Omkar, Ademtew Henok Biadglign, Hmaiti Yahya, Kumar Amandeep, Kuckreja Kartik, et al. All languages matter: Evaluating lmms on culturally diverse 100 languages. arXiv preprint arXiv:2411.16508, 2024. [Google Scholar]
[37].Vayani Ashmal, Dissanayake Dinura, Watawana Hasindri, Ahsan Noor, Sasikumar Nevasini, Thawakar Omkar, Ademtew Henok Biadglign, Hmaiti Yahya, Kumar Amandeep, Kukreja Kartik, et al. All languages matter: Evaluating lmms on culturally diverse 100 languages. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 19565–19575, 2025. [Google Scholar]
[38].Wade Eric, Parnandi Avinash Rao, and Mataric Maja J. Automated administration of the wolf motor function test for post-stroke assessment. In 2010 4th International Conference on Pervasive Computing Technologies for Healthcare, pages 1–7. IEEE, 2010. [Google Scholar]
[39].Wade Eric and Winstein Carolee J. Virtual reality and robotics for stroke rehabilitation: where do we go from here? Topics in stroke rehabilitation, 18(6):685–700, 2011. [DOI] [PubMed] [Google Scholar]
[40].Zhu Wentao, Ma Xiaoxuan, Liu Zhaoyang, Liu Libin, Wu Wayne, and Wang Yizhou. Motionbert: A unified perspective on learning human motion representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023. [Google Scholar]

[R1] [1].Balasubramanian Sivakumar, Wei Ruihua, Herman Richard, and He Jiping. Robot-measured performance metrics in stroke rehabilitation. In 2009 ICME International Conference on Complex Medical Engineering, pages 1–6. IEEE, 2009. [Google Scholar]

[R2] [2].Brunner Iris C, Andrinopoulou Eleni-Rosalina, Selles Ruud, Lundquist Camilla Biering, and Pedersen Asger Roer. External validation of a dynamic prediction model for upper limb function after stroke. Archives of Rehabilitation Research and Clinical Translation, 6(1):100315, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Chen Shuya, Lewthwaite Rebecca, Schweighofer Nicolas, and Winstein Carolee J. Discriminant validity of a new measure of self-efficacy for reaching movements after stroke-induced hemiparesis. Journal of Hand Therapy, 26(2):116–123, 2013. [DOI] [PubMed] [Google Scholar]

[R4] [4].Dosovitskiy Alexey, Beyer Lucas, Kolesnikov Alexander, Weissenborn Dirk, Zhai Xiaohua, Unterthiner Thomas, Dehghani Mostafa, Minderer Matthias, Heigold Georg, Gelly Sylvain, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. [Google Scholar]

[R5] [5].Duan Haodong, Zhao Yue, Chen Kai, Lin Dahua, and Dai Bo. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2969–2978, 2022. [Google Scholar]

[R6] [6].Dukelow Sean P, Herter Troy M, Moore Kimberly D, Demers Mary Jo, Glasgow Janice I, Bagg Stephen D, Norman Kathleen E, and Scott Stephen H. Quantitative assessment of limb position sense following stroke. Neurorehabilitation and neural repair, 24(2):178–187, 2010. [DOI] [PubMed] [Google Scholar]

[R7] [7].Gowland Carolyn, DeBruin Hubert, Basmajian John V, Plews Nancy, and Burcea Ion. Agonist and antagonist activity during voluntary upper-limb movement in patients with stroke. Physical therapy, 72:624–624, 1992. [DOI] [PubMed] [Google Scholar]

[R8] [8].Gupta Animesh, Hasan Irtiza, Prasad Dilip K, and Gupta Deepak K. Data-efficient training of cnns and transformers with coresets: A stability perspective. arXiv preprint arXiv:2303.02095, 2023. [Google Scholar]

[R9] [9].Gupta Animesh, Parmar Jay, Dave Ishan Rajendrakumar, and Shah Mubarak. From play to replay: Composed video retrieval for temporally fine-grained videos. arXiv preprint arXiv:2506.05274, 2025. [Google Scholar]

[R10] [10].Hebert Jacqueline S and Lewicke Justin. Case report of modified box and blocks test with motion capture to measure prosthetic function. Journal of Rehabilitation Research & Development, 49(8), 2012. [Google Scholar]

[R11] [11].Hutchinson Matthew S and Gadepally Vijay N. Video action understanding. IEEE Access, 9:134611–134637, 2021. [Google Scholar]

[R12] [12].Kaku Aakash, Liu Kangning, Parnandi Avinash, Rajamohan Haresh Rengaraj, Venkataramanan Kannan, Venkatesan Anita, Wirtanen Audre, Pandit Natasha, Schambra Heidi, and Fernandez-Granda Carlos. Strokerehab: A benchmark dataset for subsecond action identification. Advances in neural information processing systems, 35:1671–1684, 2022. [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Khirodkar Rawal, Bagautdinov Timur, Martinez Julieta, Zhaoen Su, James Austin, Selednik Peter, Anderson Stuart, and Saito Shunsuke. Sapiens: Foundation for human vision models. arXiv preprint arXiv:2408.12569, 2024. [Google Scholar]

[R14] [14].Y Li CY H Fan Wu, Mangalam K, Xiong B, Malik J, and Feichtenhofer C. Mvitv2: Improved multiscale vision transformers for classification and detection. arxiv. arXiv preprint arXiv:2112.01526, 6(8), 2021. [Google Scholar]

[R15] [15].Liu Ze, Lin Yutong, Cao Yue, Hu Han, Wei Yixuan, Zhang Zheng, Lin Stephen, and Guo Baining. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. [Google Scholar]

[R16] [16].Liu Ze, Ning Jia, Cao Yue, Wei Yixuan, Zhang Zheng, Lin Stephen, and Hu Han. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022. [Google Scholar]

[R17] [17].Liu Ziyu, Zhang Hongwen, Chen Zhenghao, Wang Zhiyong, and Ouyang Wanli. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 143–152, 2020. [Google Scholar]

[R18] [18].Lo Chung-Ming and Hung Peng-Hsiang. Predictive stroke risk model with vision transformer-based doppler features. Medical Physics, 51(1):126–138, 2024. [DOI] [PubMed] [Google Scholar]

[R19] [19].Mahmood Ahmad, Vayani Ashmal, Naseer Muzammal, Khan Salman, and Khan Fahad Shahbaz. Vurf: A general-purpose reasoning and self-refinement framework for video understanding. arXiv preprint arXiv:2403.14743, 2024. [Google Scholar]

[R20] [20].Mainali Shraddha, Darsie Marin E, and Smetana Keaton S. Machine learning in action: stroke diagnosis and outcome prediction. Frontiers in neurology, 12:734345, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Narnaware Vishal, Vayani Ashmal, Gupta Rohit, Swetha Sirnam, and Shah Mubarak. Sb-bench: Stereotype bias benchmark for large multimodal models. arXiv preprint arXiv:2502.08779, 2025. [Google Scholar]

[R22] [22].O’Brien Megan K, Shin Sung Y, Khazanchi Rushmin, Fanton Michael, Lieber Richard L, Ghaffari Roozbeh, Rogers John A, and Jayaraman Arun. Wearable sensors improve prediction of post-stroke walking function following inpatient rehabilitation. IEEE Journal of Translational Engineering in Health and Medicine, 10:1–11, 2022. [Google Scholar]

[R23] [23].Parnandi Avinash, Kaku Aakash, Venkatesan Anita, Pandit Natasha, Fokas Emily, Yu Boyang, Kim Grace, Nilsen Dawn, Fernandez-Granda Carlos, and Schambra Heidi. Data-driven quantitation of movement abnormality after stroke. Bioengineering, 10(6):648, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] [24].Paszke Adam, Gross Sam, Chintala Soumith, Chanan Gregory, Yang Edward, DeVito Zachary, Lin Zeming, Desmaison Alban, Antiga Luca, and Lerer Adam. Automatic differentiation in pytorch. 2017. [Google Scholar]

[R25] [25].Potter Kirsten, George D Fulk Yasser Salem, and Sullivan Jane. Outcome measures in neurological physical therapy practice: part i. making sound decisions. Journal of Neurologic Physical Therapy, 35(2):57–64, 2011. [DOI] [PubMed] [Google Scholar]

[R26] [26].Qureshi Rizwan, Sapkota Ranjan, Shah Abbas, Muneer Amgad, Zafar Anas, Vayani Ashmal, Shoman Maged, Eldaly Abdelrahman, Zhang Kai, Sadak Ferhat, et al. Thinking beyond tokens: From brain-inspired intelligence to cognitive foundations for artificial general intelligence and its societal impact. arXiv preprint arXiv:2507.00951, 2025. [Google Scholar]

[R27] [27].Raza Shaina, Qureshi Rizwan, Zahid Anam, Fioresi Joseph, Sadak Ferhat, Saeed Muhammad, Sapkota Ranjan, Jain Aditya, Zafar Anas, Hassan Muneeb Ul, et al. Who is responsible? the data, models, users or regulations? responsible generative ai for a sustainable future. arXiv preprint arXiv:2502.08650, 2025. [Google Scholar]

[R28] [28].Saeed Muhammed, Raza Shaina, Vayani Ashmal, Abdul-Mageed Muhammad, Emami Ali, and Shehata Shady. Beyond content: How grammatical gender shapes visual representation in text-to-image models. arXiv preprint arXiv:2508.03199, 2025. [Google Scholar]

[R29] [29].Stinear Cathy M, Smith Marie-Claire, and Byblow Winston D. Prediction tools for stroke rehabilitation. Stroke, 50(11):3314–3322, 2019. [DOI] [PubMed] [Google Scholar]

[R30] [30].Jane E Sullivan A Andrews Williams, Lanzino Desiree, Peron Aimee, and Potter Kirsten A. Outcome measures in neurological physical therapy practice: part ii. a patient-centered process. Journal of Neurologic Physical Therapy, 35(2):65–74, 2011. [DOI] [PubMed] [Google Scholar]

[R31] [31].Thawakar Omkar, Vayani Ashmal, Khan Salman, Cholakal Hisham, Anwer Rao M, Felsberg Michael, Baldwin Tim, Xing Eric P, and Khan Fahad Shahbaz. Mobillama: Towards accurate and lightweight fully transparent gpt. arXiv preprint arXiv:2402.16840, 2024. [Google Scholar]

[R32] [32].Touvron Hugo, Cord Matthieu, Douze Matthijs, Massa Francisco, Sablayrolles Alexandre, and Jegou Herve. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, volume 139, pages 10347–10357, July 2021. [Google Scholar]

[R33] [33].Tran Du, Wang Heng, Torresani Lorenzo, Ray Jamie, LeCun Yann, and Paluri Manohar. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018. [Google Scholar]

[R34] [34].AS Update. Heart disease and stroke statistics–2017 update. Circulation, 135:e146–e603, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] [35].Van der Lee Johanna H, Wagenaar Robert C, Lankhorst Gustaaf J, Vogelaar Tanneke W, Devillé Walter L, and Bouter Lex M. Forced use of the upper extremity in chronic stroke patients: results from a single-blind randomized clinical trial. Stroke, 30(11):2369–2375, 1999. [DOI] [PubMed] [Google Scholar]

[R36] [36].Vayani Ashmal, Dissanayake Dinura, Watawana Hasindri, Ahsan Noor, Sasikumar Nevasini, Thawakar Omkar, Ademtew Henok Biadglign, Hmaiti Yahya, Kumar Amandeep, Kuckreja Kartik, et al. All languages matter: Evaluating lmms on culturally diverse 100 languages. arXiv preprint arXiv:2411.16508, 2024. [Google Scholar]

[R37] [37].Vayani Ashmal, Dissanayake Dinura, Watawana Hasindri, Ahsan Noor, Sasikumar Nevasini, Thawakar Omkar, Ademtew Henok Biadglign, Hmaiti Yahya, Kumar Amandeep, Kukreja Kartik, et al. All languages matter: Evaluating lmms on culturally diverse 100 languages. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 19565–19575, 2025. [Google Scholar]

[R38] [38].Wade Eric, Parnandi Avinash Rao, and Mataric Maja J. Automated administration of the wolf motor function test for post-stroke assessment. In 2010 4th International Conference on Pervasive Computing Technologies for Healthcare, pages 1–7. IEEE, 2010. [Google Scholar]

[R39] [39].Wade Eric and Winstein Carolee J. Virtual reality and robotics for stroke rehabilitation: where do we go from here? Topics in stroke rehabilitation, 18(6):685–700, 2011. [DOI] [PubMed] [Google Scholar]

[R40] [40].Zhu Wentao, Ma Xiaoxuan, Liu Zhaoyang, Liu Libin, Wu Wayne, and Wang Yizhou. Motionbert: A unified perspective on learning human motion representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023. [Google Scholar]

PERMALINK

STROKEVISION-BENCH: A MULTIMODAL VIDEO AND 2D POSE BENCHMARK FOR TRACKING STROKE RECOVERY

David Robinson

Animesh Gupta

Rizwan Qureshi

Qiushi Fu

Mubarak Shah

Abstract

1. INTRODUCTION

2. RELATED WORK

Expensive and training-intensive methods:

Existing video action classification datasets for stroke:

Existing video and 2D skeleton action classification methods: