Supplementary Figures 1-5 Legends Extended Data Fig. 1 Estimation of the scorer effect in clinical FI items. A, The effect of tester varies across FI items. B, The estimated random effect across 4 scorers in the data set. Extended Data Fig. 2 Detailed modeling analysis. A, The distribution of age across 643 data points (533 mice). The distribution of manual FIadj scores across 643 data points (533 mice). B, To determine the contributions of frailty parameters in predicting Age, we calculated the feature importance of all frailty parameters. We discover that gait disorders, kyphosis and piloerection have the highest contributions. C, The random forest regression model performed better than other models with the lowest root-mean-squared error (RMSE) (n?=?50 independent train-test splits, p??0.95. G, The residuals versus the index and predicted versus true for training (Column 1; residual standard error = 8.5, difference in slopes (black vs gray) = 0.11) and test sets (Column 2; residual standard error = 15.87, difference in slopes (black vs gray) = 0.30) for the model that predicts Age using frailty index items for both training and test data. H, I, Out-of-bag (OOB) error based 95% prediction intervals (PIs) (gray lines) quantifying uncertainty in point estimates/predictions (gray dots). There is one interval per test mouse (n?=?107 unique mice, the test data contains some repeats of the same mice tested at different ages) and approximately 95% of the PI intervals contain the correct Age (red dots) and FI scores (blue dots). We ordered the x-axis (Test set index) in ascending order (from left to right) of the actual age/FI. The average PI width for all test mouseŐs predicted FI score is 0.18?±?0.04 (resp. 71.96?±?18.52 for the predicted Age), while the PI lengths range from 0.08 to 0.29 (resp. 28 to 113 for Age). n (C, D and E), the lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles) respectively, the line in the middle corresponds to the median, the upper (lower) hinge extends from the upper (lower) hinge to the largest (smallest) value not bigger (smaller) than 1.5 ? IQR where IQR is the interquartile range. Extended Data Fig. 3 Correlation between video metrics. A, Correlation between average/mean (x-axis) and median (y-axis) video gait metrics. The diagonal line corresponds to maximum correlation i.e. 1. B, Correlation between inter-quartile range (IQR, x-axis) and standard deviation (Stdev, y-axis) video gait metrics. The diagonal line corresponds to maximum correlation i.e. 1. A tight wrap of points around the diagonal line indicates a high correlation between mean and median or IQR and standard deviation for the respective metric. Extended Data Fig. 4 Test for SimpsonŐs paradox. A, Simpson (1951) showed that the statistical relationship observed in the population could be reversed within all of the subgroups that make up that population, leading to erroneous conclusions drawn from the population data. To test for the manifestation of SimpsonŐs paradox in our data, we split the bimodal Age distribution into two separate unimodal distributions (clusters), that is, less than 70 weeks old (L70, red) versus more than 70 weeks old (U70, blue). Next, we plotted the dependent variable (frailty) against each of the independent variables/features in our data and fit a simple linear regression model to each subgroup separately (solid red and blue lines) as well as to the aggregate data (black dotted line). B, We quantified the correlations by measuring the slope of the linear fits of the features (Y) on Age (X). We computed the slopes for L70, U70 and overall (All), then plotted the slopes for features in decreasing order of their relevance to the model (where we predict Age from these features). We went further and performed one-way ANOVA to test for differences in slopes between L70 and U70 sub-groups and the overall data (one-way ANOVA, F2,141?=?1.162, p?>?0.32). Next, we performed a false discovery rate adjusted post hoc pairwise comparisons using the t-test. We found no significant differences in the comparisons (L70 versus U70, p?=?0.38, L70 versus All, p?=?0.77 and U70 versus All, p?=?0.38). We found that SimpsonŐs paradox does not manifest in any of the top fifteen features in our data. Extended Data Fig. 5 Further experiments to test model performance and parameters. A, We compare the performance of different feature sets, 1) age alone, 2) video and 3) age + video, in predicting frailty across n?=?50 independent train-test splits. We use age alone as a feature in a linear (AgeL) and a generalized additive non-linear model (AgeG). Although we didnŐt notice a clear improvement of the random forest model (VideoRF) using video features over a vFI prediction based on age alone, a clear improvement in prediction performance is seen for the model (AllRF), which contains video features + age with lowest MSE (p?