Towards a Comprehensive Solution for a Vision-Based Digitized Neurological Examination

Trung-Hieu Hoang; Mona Zehni; Huaijin Xu; George Heintz; Christopher Zallek; Minh N Do

doi:10.1109/JBHI.2022.3167927

. Author manuscript; available in PMC: 2022 Nov 29.

Published in final edited form as: IEEE J Biomed Health Inform. 2022 Aug 11;26(8):4020–4031. doi: 10.1109/JBHI.2022.3167927

Towards a Comprehensive Solution for a Vision-Based Digitized Neurological Examination

Trung-Hieu Hoang ¹, Mona Zehni ², Huaijin Xu ³, George Heintz ⁴, Christopher Zallek ⁵, Minh N Do ⁶

PMCID: PMC9707344 NIHMSID: NIHMS1829649 PMID: 35439148

Abstract

The ability to use digitally recorded and quantified neurological exam information is important to help healthcare systems deliver better care, in-person and via telehealth, as they compensate for a growing shortage of neurologists. Current neurological digital biomarker pipelines, however, are narrowed down to a specific neurological exam component or applied for assessing specific conditions. In this paper, we propose an accessible vision-based exam and documentation solution called Digitized Neurological Examination (DNE) to expand exam biomarker recording options and clinical applications using a smartphone/tablet. Through our DNE software, healthcare providers in clinical settings and people at home are enabled to video capture an examination while performing instructed neurological tests, including finger tapping, finger to finger, forearm roll, and stand-up and walk. Our modular design of the DNE software supports integrations of additional tests. The DNE extracts from the recorded examinations the 2D/3D human-body pose and quantifies kinematic and spatio-temporal features. The features are clinically relevant and allow clinicians to document and observe the quantified movements and the changes of these metrics over time. A web server and a user interface for recordings viewing and feature visualizations are available. DNE was evaluated on a collected dataset of 21 subjects containing normal and simulated-impaired movements. The overall accuracy of DNE is demonstrated by classifying the recorded movements using various machine learning models. Our tests show an accuracy beyond 90% for upper-limb tests and 80% for the stand-up and walk tests.

Keywords: Digital biomarkers, digitized exams, teleneurology, quantitative analysis, disease documentation, monitoring, finger tapping, finger to finger, forearm roll, stand-up and walk, gait, human pose, machine learning

I. Introduction

THE burden and prevalence of neurological disorders [1] and the national shortage of neurologists [2] continue to grow hand in hand. This increases disparity through unequal access to clinical care and drives worsening clinician burnout rates. Meanwhile, the COVID-19 pandemic has boosted the transition from in-person to virtual neurological examinations [3], [4] through teleneurology (TN) platforms. Raplidly developing TN has shown potential in making efficient assessments remotely [5]–[7] and helping in distributing scarce healthcare resources and enhancing accessibility to neurological care [8], [9]. In addition, digital biomarker exam solutions with quantification of physical evaluations that bypass clinician availability and subjectivity of assessments [10] are important to improve care and compensate for the shortage of neurologists.

Current digital biomarker exam systems are devoted to a single neurological test [11]–[13], require advanced setups/equipment [14], or lack automated assessments [15], [16]. Therefore, a digital biomarker solution, 1) suitable for use by neurologists and non-neurologists, 2) with wide applicability at clinics or home, 3) that is easy to deploy, 4) supports a wide range of neurological tests, and 5) enables automated objective quantitative evaluations, would significantly advance health care delivery.

For this purpose, in this work, we introduce an end-to-end vision-based exam and documentation platform named Digitized Neurological Examination (DNE). As part of DNE, we designed an easy-to-use smartphone/tablet software with predefined examination instructions. The DNE software allows the users to video record their performance on several neurological screening examinations, including finger tapping (FT), finger to finger (FTF), forearm roll (FR), and stand-up and walk (SAW). These recordings are uploaded to a secure cloud-based storage. In an offline step, for each recording, 2D/3D pose, estimating the location of major human body keypoints, is extracted using deep-learning-based solutions such as OpenPose [17], and VideoPose3D [18]. From the estimated pose, unified digital biomarkers, including spatio-temporal and kinematic features, are computed [19]. We showcase the performance of our system on a dataset collected from 21 healthy subjects taking different neurological tests (FT, FTF, FR, SAW) when their function is normal or with a simulated impairment. We incorporate our defined features in a variety of machine learning models to detect abnormal functioning in our dataset. Fig. 1 illustrates the capabilities our DNE system.

Fig. 1. — Illustration of our digitized neurological exam system.

We summarize the key contributions of this work as:

We develop a unified and modular software package for high-quality DNE recording collection. Our DNE software is easy-to-use, allows the integration of new tests, and runs on handheld iOS devices. We also implement a web-based dashboard for viewing the recordings and feature visualization.
We propose a vision-based approach to study various neurological tests (FT, FTF, FR, and SAW). For each test, we define clinically interpretable kinematic and spatiotemporal quantified features.
To the best of our knowledge, we are the first to construct a vision-based dataset consisting of multiple neurological tests and simulated-impaired video recordings per subject alongside the extracted 2D/3D pose. Analyzing this dataset allows us to have a normal self-baseline for each abnormal recording and test the power of the extracted features in distinguishing normal from abnormal performance. Our dataset (excluding RGB videos due to privacy restrictions) and code will be available at https://dneproject.web.illinois.edu/.

The organization of this paper is as follows. Section II summarizes recent studies on digital biomarker systems. Section III describes DNE’s software platform used in our data collection. Section IV introduces our DNE dataset. We define our features in detail in Section V. Section VI contains our analysis results while Section VIII draws our main conclusions.

II. Related Work

In this section, we review the related literature to different tests (FT, FTF, FR, SAW). For each test, we briefly discuss the existing sensor, web/smartphone and vision-based solutions.

Finger Tapping (FT):

Sensor-based FT assessments study spectral analysis of gyroscope data [20], opening finger tap velocity captured by accelerometers [21], standard deviation, range and entropy measured by a collection of sensors including synchronized wrist watches, pressure sensors and accelerometers [22]. Several smartphone based applications [23]–[26] are designed to quantitatively evaluate various symptoms and motor skills in patients with Parkinson’s Disease (PD). While these approaches are proven effective and low cost, their measurements are not as informative as vision-based methods, relying on video data and simulating in-person clinical examinations. Among vision-based pipelines, [11], [27]–[29] extract a set of kinematic interpretable features from the tracked positions of the fingers given an RGB video. These features are easy to explain and associate with clinical symptoms. On the other hand, black box deep learning models operating on the estimated finger poses and their derivatives are proposed in [30]. While these solutions provide high accuracy, unlike our DNE, they lack explainability and require large training sets to generalize and avoid overfitting.

Finger to Finger (FTF):

A well-studied test in the literature that is similar to FTF in terms of measuring smoothness and upper extremity coordination is the finger to nose test. Among sensor-based methods, Rodrigues et al. in [31] investigates the coordination ability of patients with chronic stroke versus healthy control using a complex marker-based motion analysis system. Oubre et al. [32] studied ataxia through wearable inertial sensors and a computer tablet version of finger to nose test. Furthermore, predicting severity levels of ataxia or PD via a rapid web-based computer mouse test is explored in [33]. Jaroensri et al. [12] is among the first to propose vision-based solutions that are on par with a specialist in terms of rating the severity scale of PD while using estimated joint positions from recorded videos.

Upper Limb Tests:

To the best of our knowledge, sensor-based or vision-based studies related to the forearm roll task are scarce. Thus, here we further overview the existing methods devoted to the study of upper limb movements. Using wearable sensors, Cruz et al. in [14] assessed the acceleration, velocity or smoothness of the upper limb motor function of patients after stroke. A low-cost Kinect based solution, tracking subjects’ hand when asked to move a marker on a rectangular pattern is proposed in [34]. The range of motion is analyzed using an internet-based goniometer in [35]. In [36], the authors describe a vision-based system that captures upper limb motions via multiple cameras installed at different views. While this multi-camera system is less sensitive to occlusions and dynamic backgrounds, unlike our DNE system, it requires a special setup which is hard to install for home-use.

Stand-up and Walk (SAW):

In our review of gait analysis literature, we focus on the marker-less [37] vision-based solutions, mainly measured using general handheld cameras and mobile devices. In early efforts for marker-less gait analysis, silhouettes are extensively used to detect heel-strike and toe-off occurrences. These two events refer to the first and last ground contact of each foot, later on adopted to accurately estimate important gait parameters [38]–[41]. However, these methods are restricted to specific laboratory settings and are sensitive to the quality of foreground/background segmentation. The surge of research in the human pose estimation field [42]–[44] brought along popular deep learning frameworks which accurately estimate the 2D/3D location of body joints from different inputs including RGB image, video and depth maps [17], [18], [45], [46]. Depth-map based gait assessment solutions relying on the estimated pose from either depth or RGBD [47], [48], have studied the rotational angle and angle velocity of certain body keypoints [49] and evaluated the spatio-temporal gait metrics such as step length and time [13], [50].

Wei et al. [16] introduced an automated smart-phone based video capturing system with hand/body pose estimation. While neurological exams such as gait are considered in [16], feature extraction and analysis is not studied and the main focus is on the quality control of the video acquisition process. Using the estimated pose from OpenPose [17], Xue et al. [13] studied the remote monitoring of gait parameters for senior care. Furthermore, [51] reports timings of different segments of the timed-up-and-go (TUG) test by performing frame-based activity classification based on 2D pose data. To assess the freezing of gait (FoG) symptom in Parkinson patients, [52] proposed the use of frequency analysis methods while [53] adopted graph convolutional neural networks to attain the probability of FoG from pose data. Kidziński in [54] employed black-box deep learning models to estimate the level of movement disorder in children suffering from cerebral palsy. Despite their promising results, deep learning based solutions are less interpretable and require large training supervised datasets for better generalization.

III. System Design

As part of DNE, we developed three software packages to maintain data acquisition, analysis and results report.

DNE Recorder:

This module accommodates easy-to-use self or assisted video recording on a set of pre-defined neurological tests. DNE Recorder is an iOS mobile application. It includes detailed instructions on how to perform each test alongside automated video capturing functions. Our software facilitates recording of high quality depth maps on devices equipped with LiDAR. We collect 1080 × 720 high-quality RGB, depth videos (upon applicable hardware) and camera calibration parameters at 60 frames per second (FPS). All recordings are synchronized into a secure cloud storage for offline processing. The user interface of this module is shown in Fig. 2(a).

Fig. 2. — DNE System. (a) *DNE Recorder* - an iOS application for neurological recordings collection. (b) *DNE Viewer* a web application for dataset management, video previewing and visualizing the analysis results (best viewed in magnification).

DNE Analyzer:

We analyze the RGB recordings offline in a separate module. The main components of DNE Analyzer include 1) vision-based pose estimation, 2) feature extraction, 3) abnormality detection.

DNE Viewer:

We provide a secure web application for clinicians, neurologists and researchers to monitor raw recordings and view the analysis results from all subjects remotely. Fig. 2 (b) displays a screenshot of the DNE Viewer user interface.

IV. Dataset Collection

Our dataset collection protocol is IRB approved (#IRB.1452500) on 02/27/2020 by the University of Illinois College of Medicine at Peoria Institute Review Board 1. In this study, 21 healthy volunteers (18 females/3 males) were recruited by sampling of convenience at the OSF HealthCare Illinois Neurological Institute Outpatient Neurology Clinic (Peoria, IL). Neurological examinations examine fine motor and mobility abilities. We study the FT, FTF, FR for fine motor tasks, and evaluate the mobility by the SAW test. Below we describe in detail how these tasks are performed.

FT: Participants are instructed to put their hands within the camera view when their index fingers and thumbs were touched. Then they would start tapping them as big open and close, and fast as they could for 15 seconds.
FR: Participants are asked to gently clench their hands, hold their forearms horizontally, and roll their hands around each other as fast as possible for 15 seconds.
FTF: Participants repetitively first point their index fingers towards the ceiling and then touch their fingers together in front of their chests for a duration of 15 seconds.
SAW: Participants stand from a sitting pose in a chair, move the chair out of the way, walk back and forth 15 feet. The designated time for SAW test is 45 seconds.

Each subject took two sets of neurological examinations supervised by a neurologist. In the first set of examinations, the subjects performed the tasks normally. However, for the second set, the subjects were asked to simulate motor dysfunction, i.e. perform the test abnormally. For this purpose, the subjects wore devices to deliberately add disruption to their performance and mimic impairments. For FT, a rubber band is used to restrict movements of the index and thumb fingers. For the FR and SAW tests the subjects put on a left wrist and a knee brace, respectively. On the other hand, for the FTF test, the subjects were asked to deliberately mimic a tremor pattern in moving their fingers and hands. Snapshots of recordings and subjects wearing the devices are exhibited in Fig. 3.

Fig. 3. — Examples of DNE dataset recordings. Impairments are induced by wearing a wrist brace for FR, a rubber band for FT and a knee brace for SAW tests.

Both set of recordings are acquired by our DNE Recorder on iPad 11 Pro and iPhone 11 devices. For upper body tests, we have a close-up frontal view of the subjects with visible pelvis. Moreover, to assess the invariance of our analysis under small deviations from the frontal camera view, the view of the recordings taken on iPhone is slightly to the left compared to the iPad recordings. In addition, for the SAW, we record both saggital and frontal views, using iPad and iPhone, respectively. In total, including all four tests (FR, FT, FTF, SAW), we collect 375 videos. Table I provides a summary of our dataset.

TABLE I.

Summary of Our DNE Dataset

Test	Total	Label		View		Video
Test	Total	Normal	Abnormal	Front	Side	RGB/D	RGB

FT	95	41	54	95	-	45	50
FR	92	47	45	92	-	40	52
FTF	85	41	44	85	-	45	40
SAW	103	41	62	61	42	54	49

Open in a new tab

While there is hardly any similar publicly available upper-body neurological related dataset, there are several datasets studying gait impairments specifically in [13], [38], [39], [52], [54]. The closest to our dataset is KIMORE [55] focusing on rehabilitation exercises rather than neurological tests. The KIMORE provides RGB, depth, and pose data for each recording, collected by Kinect v2 which is not as ubiquitous as handheld devices adopted in DNE. In Table II, we compare our dataset versus state of the art public gait impairment datasets in various aspects. For this comparison, we only focus on studies using a single-view, portable camera for data collection, similar to our setting. Accordingly, we list the contributions introduced by our dataset as: 1) This is the first public dataset studying multiple neurological test segments. 2) Our dataset includes normal and abnormal performance of the same task for each particular subject. 3) Our dataset contains multiple data modalities, including depth videos, camera parameters, and 2D/3D pose estimation.

TABLE II.

Comparison Between Multiple Vision-Based Gait Impairment Video Datasets, Acquired by a Single Camera

Dataset	Availability	Sagittal View	Frontal View	Data Type	Mobile Device	Number of Subjects	Number of Sequences	Pose Estimation	Normal and Abnormal Pairs

Xue et al. [13]	✘	-	-	RGB	✘	-	-	2D	✘
Sato et al. [52]	✘	✘	✔	RGB	✘	2	2	2D	✘
Orteils et al. [38]	✔	✘	✔	Binary	✘	10	20	✘	✔
Nieto-Hidalgo et al. [39]	✔	✔	✔	Binary	✔	-	73	✘	✔
Kidzinski et al. [54]	✔	✔	✘	RGB	✘	1026	1792	2D	✘
Ours	✔	✔	✔	RGB/D	✔	21	336	2D/3D	✔

Open in a new tab

V. DNE Vision-Based Analysis

In our DNE analysis pipeline, given an RGB video, we first compute the human pose in each frame. Next, from the pose time series, we extract a set of features that quantify the subject’s performance in various aspects. We structure our analysis pipeline into three layers, namely 1) pose estimation, 2) feature extraction, and 3) application layer, as illustrated in Fig. 4. The pose estimation layer provides frame-level high-quality 2D/3D joint locations (Section V-A). We pre-process the estimated pose to prepare it for feature computation. In the feature extraction layer, we calculate a set of features that describe subject’s performance on various tests. We carefully design these features for each test separately to accurately reflect the subjects’ performance and dedicated abnormalities. Lastly, the application layer contains several downstream tasks consuming the features, including abnormality detection and visualization for a qualitative comparison among recordings.

Fig. 4. — Overview of DNE vision-based analysis framework.

A. Pose Estimation

For upper body tests (FT, FTF, and FR), we use OpenPose (OP) to estimate the 2D hand [56] and body [17] pose. On the other hand, for SAW tests, we compute the 3D pose using the VideoPose3D (VP3D) package [18]. Given an RGB image, OP first detects all visible body parts and associates them to each individual by solving a graph matching problem. Meanwhile, VP3D adopts dilated temporal convolution to estimate 3D pose from sequence of 2D keypoints extracted from the video.

For upper body tests, if the subject and the moving limb is located parallel to the camera plane, then the motion is well approximated in a plane, i.e. in two dimensions. That is why 2D pose is chosen for upper body tests. However, this might not hold for the SAW test (especially depending on the camera view), hence urging us to use 3D pose for this analysis.

B. Pre-processing

We truncate a recording to only include the sequence of frames that are related to the subject performing the test. To account for variable distance of the subjects from the camera, we normalize the estimated pose by a reference length. For FT, FTF and FR tests, the reference is the length of the forearm. For SAW, the reference is the distance between the pelvis and neck joints. We compute the reference lengths as the median of the value across all the frames. In addition, as the estimated pose can be erroneous at some frames we use median and Savitzky-Golay filtering [58]. In our dataset, we have excluded 27 recordings due to unreliable and noisy estimated pose. Therefore, we only analyzed 348 videos in total.

C. Notations

Given the pose sequence estimated from the RGB video, we extract a set of quantified features. Below, we first express our notations and then introduce the features we defined for each test. Let υ = [υ₁,...,υ_N] denote the set of N frames ordered chronologically in video υ. There is a one-to-one correspondence between the time associated with each frame and the frame index, where t = [t₁,...,t_N] and t_i = i/fps, fps denoting the frame per second rate of the video. Given υ and the pose estimation module (such as OP or VP3D), we extract the location of K keypoints in each frame. For convenience, we use the same indexing of the body joints for both 2D and 3D pose. However, to differentiate between the 2D and 3D pose, we denote each by B₂ and B₃, respectively. Furthermore, we use H₂ to represent the 2D hand keypoints. An illustration of the hand and body skeleton trees alongside our indexing notations are provided in Fig. 5. Note that, for the sake of brevity, we have only indexed a subset of the keypoints that we are using in our analysis.

Fig. 5. — Skeleton tree for (a) body B, and (b) hand H. Examples of human pose estimation (c) in 2D (B₂*, H*₂) using OpenPose [17] and (d) in 3D (B₃) using VideoPose3D [18].

We reserve s_k,*[i] for the location of the k-th keypoint at frame i, corresponding to skeleton tree * ∈ {H₂, B₂, B₃}. For * ∈ {H₂, B₂}, s_k*[i] ∈ ℝ² and for * = B₃, s_k,*[i] ∈ ℝ³. Furthermore, we add superscript r and l to point to right and left (R/L) body parts, respectively. For example, $s_{3, H_{2}}^{r} [i]$ locates the tip of the right thumb at frame i.

To extract kinematic features that quantify the performance of a subject in a test, we track the location of various major keypoints and define a set of features accordingly. Major keypoints vary based on the test. For instance, the major keypoints in FT include the tip of the index and thumb fingers of two hands while in FR, we closely track the wrist joints.

In different tests, the subjects are asked to move certain limbs repeatedly. Thus, it is natural to compute features such as frequency, and amplitude for periodic pose patterns and report the mean and standard deviation (STD) across different cycles. In addition, for a test performed normally, the features corresponding to the R/L body parts should be close. Thus, to quantify the difference between the right f^r and left f^l features, we define an asymmetry metric as:

Asym (f^{r}, f^{l}) = \frac{| f^{r} - f^{l} |}{f^{r} + f^{l}} .

(1)

Another useful metric in our analysis is Pearson correlation coefficient denoted by CC. For two 1D discrete time series x₁ and x₂, we define CC as:

CC (x_{1}, x_{2}) = \frac{{(x_{1} - {\bar{x}}_{1})}^{T} (x_{2} - {\bar{x}}_{2})}{{‖ x_{1} - {\bar{x}}_{1} ‖}_{2} {‖ x_{2} - {\bar{x}}_{2} ‖}_{2}} .

(2)

where $\bar{.}$ and .^T denote the mean and transpose operators. For highly correlated series, |CC| is close to one.