. 2022 Nov 21;24(11):e37683. doi: 10.2196/37683

Table 2.

Summary table of the between-study and within-study findings on the differences in the validity of sensor-derived measurements of motor function across various groups.

Are there differences in the validity of sensor-derived measures of motor function as captured	Between-study (ie, meta-analytic) findings	Within-study findings
Using mass market devices vs medical sensors?	No: digital technology vs mass market digital technologies (P=.22); mass market digital technology vs medical devices (P=.21); digital technology vs medical devices (P=.32)	Insufficient data to evaluate
At specific sensor locations?	No: wrist vs ankle (P=.73); wrist vs chest (P=.73); wrist vs hand (P=.54); wrist vs thigh (P=.59); wrist vs back (P=.63); wrist vs pocket (P=.78); wrist vs nonwearable (0.31) No: ankle vs chest (P=.46); ankle vs hand (P=.38); ankle vs thigh (P=.73); ankle vs waist (P=.60); ankle vs back (P=.49); ankle vs pocket (P=.65); ankle vs nonwearable (P=.58) No: chest vs hand (P=.30); chest vs thigh (P=.39); chest vs waist (P=.70); chest vs back (P=.82); chest vs pocket (P=.50); chest vs nonwearable (P=.89) No: hand vs thigh (P=.58); hand vs waist (P=.75); hand vs back (P=.78); hand vs pocket (P=.42); hand vs nonwearable (P=.53) No: thigh vs waist (P=.86); thigh vs back (P=.73); thigh vs pocket (P=.54); thigh vs nonwearable (P=.40) No: waist vs back (P=.87); waist vs pocket (P=.39); waist vs nonwearable (P=.24) No: back vs pocket (P=.45); back vs nonwearable (P=.48); pocket vs nonwearable (P=.50)	Insufficient data to evaluate
home vs in the laboratory?	No; P=.33	No; 1 study found AUC^a values of 0.76 (when administered at home) vs 0.83 (when administered in clinic) [59]. A second study found slightly higher accuracy, sensitivity, and specificity when the task was completed at home [87].
In longitudinal vs cross-sectional studies?	No; P=.29	No; One study found high Pearson r validity coefficients (r>0.50) for over 40 distinct motion outcomes but very low validity coefficients for a handful, including deflection rage roll (measured in degrees), mean sway velocity roll (measured in degrees per second), and up-down deviation (measured in centimeters) [69]. A second study found Pearson r validity coefficients above 0.50 for variables related to steps taken, distance, and speed, but coefficients below 0.50 for variables related to angles (eg, trunk, hips, ankle, trunk, upper limb, and full body) [78]. A third study found Pearson r validity coefficients above 0.50 for gait, arising from chair, body bradykinesia, hypokinesia, and overall posture and validity coefficients below 0.50 for rigidity of lower and upper extremities axial rigidity, postural stability, legs agility, and tremors in lower or upper extremities [98].
In healthy vs motor impaired patients?	Yes; validity higher among healthy adults, z score 3.19, P=.001	Insufficient data to evaluate
Using different feature detection algorithms?	Insufficient data to evaluate	No; One study was able to detect movement best when using random forests relative to support vector machines and naïve Bayes [55]. A second study found that both neural networks and boosting outperformed support vector machines and Fisher linear discriminant analysis [90]. A third study found neural networks performed better than other bagging algorithms including random forest, multilayer perception, decision tree, support vector machine, and naïve Bayes [64]. A fourth study found support vector machines performed better than logistic regression and decision trees [80]. A fifth study found that random forests based on Ridge regression outperformed those based on Lasso, or Gini impurity, and that linear support vector machines outperformed logistic regression and boosting [103]. The sole consistent pattern that emerged was that supervised machine learning techniques performed better than unsupervised techniques (eg, naïve Bayes).
Using particular motion sensor signal types?	Insufficient data to evaluate	Insufficient data to evaluate
Using all vs a subset of features?	Insufficient data to evaluate	No; One study found AUC values >0.90 for 998 detected features, with a drop to 0.75 when based on the top 30 features [49]. A second study concluded “Accuracies obtained using the 30 most salient features were broadly comparable with the corresponding sensitivity and specificity values obtained using all 998 features” [42].
With the thresholds held constant across patients vs patient-specific thresholds?	No; P=.48	No; Although algorithm training typically occurred across a sample, several studies took the approach of starting the algorithm (feature detection) using data across all participants but then allowing each patient to vary in later stages such as feature selection or determining thresholds [34,54,63,68]. Validity estimates from this smaller group of studies were similar in magnitude to those studies that applied the same features and thresholds to the classification of all participants.
Using clinically supervised vs nonsupervised assessments of patient clinical status?	No; P=.16	Insufficient data to evaluate
With outliers trimmed vs retained in the feature detection stage?	Yes; trimming outliers is beneficial, z score 2.10, P=.04	Insufficient data to evaluate
With transformed data vs untransformed data?	No; P=.74	Insufficient data to evaluate
With standardized data vs unstandardized data?	No; P=.60	Insufficient data to evaluate

^aAUC: area under the curve.