Skip to main content
. 2020 May 18;15(5):e0233125. doi: 10.1371/journal.pone.0233125

Table 4. Psychometric properties of identified standardized on-road evaluation instruments.

On-road test Validity Reliability
Performance-Based Driving Evaluation (PBDE) (Odenheimer et al., 1994) Content validity Participation of experts in driving assessment and rehabilitation in order to define the tasks to include in the test. Participation of driving experts, functional and cognitive assessment experts to develop procedures, domains to be tested, score sheets and pretesting of the road test. Inter-rater reliability Correlational analyses among the raters’ driving total scores Closed road: high reliability (r = 0.84) Open road: high reliability (r = 0.74)
Internal consistency Closed road: acceptable internal consistency (α = 0.78) Open road: acceptable internal consistency (α = 0.89)
Criterion validity Strong significant positive correlation between instructor’s global score and open road’s score (r = 0.74; p<0.01) Weak significant positive correlation between instructor’s global score and closed road’s score (r = 0.44; p<0.05) Moderate significant positive correlation between the criterion (global score) and closed road’s score, suggesting that the closed course is not an adequate method to assess on-road performance. Moderate significant positive correlation between closed road’s score and open road’s score (r = 0.60; p<0.01)
Construct validity Weak significant negative correlation between the age and the instructor’s evaluation (global score) (r = -0.48; p<0.01). Strong significant positive correlation of the PBDE (open road) with the MMSE (r = 0.72; p<0.01), strong negative with complex reaction time tasks (r = -0.70; p<0.01), moderate negative with the age (r = -0.57; p<0.01), moderate positive with traffic sign recognition test (r = 0.65; p<0.01), Trail making part A (r = 0.52; p<0.01), and visual memory (r = 0.54; p<0.01) and verbal memory (r = 0.51; p<0.01).
Washington University Road Test (WURT) (Hunt et al., 1997) Criterion validity Criterion: instructor’s global rating. Moderate significant positive correlation (τ-b = 0.60; p<0.001). Marginal to moderate significant positive correlations between the 9 subscores and the instructor’s global rating (τ-b: 0.26–0.69; p<0.01) Inter-rater reliability Tested with n = 10 participants. Almost perfect reliability between the instructor (global score) and the principal investigator (WURT) (κ = 0.85), almost perfect reliability between the two investigators (κ = 0.96)
Construct validity Marginal significant negative correlation between CDR and WURT (τ-b = -0.27; p<0.001), so the more advanced the dementia is, the poorer the driving performance and vice versa (result in line with the hypothesis) Test-retest reliability Tested with n = 63 participants. Global score’s stability at one month for the instructor (0.53, unspecified statistics), quantitative score’s stability at one month for the instructor (0.76, unspecified statistics)
New Haven (Richardson & Marottoli, 2003) Internal consistency Acceptable internal consistency (α = 0.88) Inter-rater reliability Tested with n = 357 participants. Two evaluators alternated position (front-back seat) and assessed independently the participants with the 36-item scale. For the scale: excellent reliability (ICC = 0.99) For each item: almost perfect reliability for 26 items (κ>0.91, 0.911–0.998) as well as the 10 remaining (κ>0.80)
Construct validity Significant partial correlation coefficients (p<0.05) when controlling for distance vision between the road test score and visual attention (r = 0.43), executive functions (r = -0.38) and visual memory (r = 0.40)
Test Ride for Investigating Practical Fitness to Drive: Belgian Version (TRIP) Criterion validity Criterion: "pass" or "fail" category defined by the Stroke Driver Screenig Assessment (SDSA) and as comparator, CARA assessor’s global rating: 78.9% of agreement Comparison of the “pass” or “fail” result between the CARA assessor and the state-registered evaluator: 81.6% of agreement High significant positive correlative between TRIP’s global ratings and state-registered evaluator’s evaluation (r = 0.80; p<0.001) Comparison between the judgments (proportion of “pass” or “fail” people) of the CARA assessor and the judgements of the state-registered evaluator (global ratings): sensitivity of 80.6% and specificity of 100% (Akinwuntan et al., 2005) Inter-rater reliability 3 databases: (1) the 27 real-life performance assessments by A (n = 17) and B (n = 10) (CARA), (2) the 27 video recordings by A (n = 10) and B (n = 17) (CARA) and (3) the 27 video recordings by C (external assessor)
1 VS 2 (real-life performance and videos) Subitems: level of agreement of 80% and more except for 5 subitems Items: weak to good reliability for 9/17 items (ICC: 0.42–0.85) Closed road’s score: moderate reliability (ICC = 0.70) Open road’s score: moderate reliability (ICC = 0.56) Global rating: moderate reliability (ICC = 0.62 and 0.64 after excluding non-reliable items)
2 VS 3 (videos) Subitems: level of agreement of 80% and more except for 3 subitems Items: weak to excellent reliability for 13/17 items (ICC: 0.42–1.0) Closed road’s score: moderate reliability (ICC = 0.58) Open road’s score: good reliability (ICC = 0.77) Global rating: good reliability (ICC = 0.80 and 0.84 after excluding non-reliable items) (Akinwuntan et al., 2003) Subitems (ordinal scale): weighted κ: 0.44–0.78. 32 subitems have a modeate to good reliability (ICC: 0.61–0.80). Items (sum of subitems): moderate to good reliability (ICC: 0.63–0.87) Global rating: good reliability (ICC = 0.83) (Akinwuntan et al., 2005)
Rhode Island Road Test (RIRT) Structural validity Factor analysis: homogeneous cluster of 21 items related to driving awareness (ICC = 0.40) that explains 31% of the variance in the scale and internal consistency (too) high (α = 0.93) 3 items related to stopping and parking were uninformative (items 5, 27, 28) Second cluster of 4 items related to speed control (3, 4, 13, 21) (ICC = 0.45) that explains 8% of the variance in the scale and acceptable internal consistency (α = 0.80) (Ott et al., 2012) Inter-rater reliability Assessed with n = 20 participants. Perfect agreement for the global rating (linear weighted ratings: κ = 0.83, quadratic weighted ratings: κ = 0.92) High positive correlation of average RIRT score between the 2 assessors (r = 0.87) (Brown et al., 2005)
Sum of Maneuvers Score (SMS) Internal consistency Internal consistency (too) high (α = 0.94), suggesting that the SMS effectively measures a single concept: the driving performance. Unidimensionality not explored (Justiss et al., 2006) Test-retest reliability Interim period of one week With dichotomous scoring for each maneuver: excellent reliability (ICC = 0.91) With a score based on the 4-points scale: excellent reliability (ICC = 0.95) Low influence of the assessor’s position on reliability (Justiss et al., 2006)
Criterion validity Criterion: global rating, significant very high positive correlation between the global rating and the SMS score (r = 0.84; p<0.001) (Justiss et al., 2006) Criterion: GRS score, ROC analysis (AUC = 0.906) with a cut-off score at 230. Sensitivity = 0.91 and specificity = 0.87 (Shechtman et al., 2010) Inter-rater reliability With dichotomous scoring for each maneuver: good reliability (ICC = 0.88) With a score based on the 4-points scale: excellent reliability (ICC = 0.94) Better reliability with a more detailed scale considering error’s severity For the GRS: excellent reliability (ICC = 0.98) (Justiss et al., 2006)
Performance Analysis of Driving Ability (P-Drive) Structural validity 3/27 items non-compliant with Rasch model’s expectations (one generates outliers, one is over-predictable and the last one is non-compliant and needs to be revised). PCA: principal component explains 59.1% (>50%) of the variance and second component explains 4.9% (<5%): results suggest unidimensionality of the scale Inter-rater reliability Excellent inter-rater reliability: random-effect intraclass correlation coefficient ICC = 0.950 (CI 95%; 0.889–0.978). Good to excellent reliability for each category (ICC: 0.875–0.963) (Vaucher et al., 2015)
(Patomella et al., 2010) PCA: principal component explains 80.3% of the variance (>60%) and the variance non-explained by the first contrast is 2.4% (<5%): results suggest unidimensionality of the scale (Patomella & Bundy, 2015) Inter-rater reliability Dichotomous score (pass VS marginal or fail), moderate reliability (κ = 0.45)
Measurement invariance Differential Item Functioning: 3 items more difficult for people with CVA VS people with MCI and one item more difficult for people with MCI VS people with CVA. Possible different item functioning between diagnoses (Patomella et al., 2010)
Criterion validity Criterion: subjective expert’s evaluation. Optimal cut-off score at 85 (numbers concerning specificity and sensitivity not mentioned in the article, but graph available) Significant marginal positive correlation between participants’ self-ratings and P-Drive evaluation (ρ = 0.24; p = 0.046) (Selander et al., 2011) Significant association between P-Drive and instructors’ evaluation (R2 = 0.445; p = 0.021) (Vaucher et al., 2015) Criterion: global medical evaluation. ROC analysis: optimal cut-off score at 81, with a sensitivity of 0.93, a specificity of 0.92 and an AUC of 0.98 PPV = 0.95 and NPV = 0.90 (Patomella & Bundy, 2015)
Person Separation Reliability Person separation reliability coefficient = 0.9 (>0.7): P-Drive separates the drivers into 4 strata with person reliability = 3.06 (Patomella et al., 2010) The person separation reliability coefficient is 0.92, which indicates that P-Drive separates drivers into 4 strata (Patomella & Bundy, 2015)
Person Response Validity 11 participants (5%) did not demonstrate good goodness-of-fit to the model. MnSq<0.6 for 5/11 (weak threat to the Person Response Validity). Results suggest acceptable Person Response Validity (Patomella et al., 2010) 96% of the data from the occupational therapists were within acceptable range for goodness-of-fit supporting good person response validity (Patomella & Bundy, 2015)
Composite Driving Assessment Scale (CDAS) (Ott et al., 2012) Structural validity Factor analysis: homogeneous cluster of 20 items (ICC = 0.40) that explains 14% of the variance in the scale and acceptable internal consistency (α = 0.89). Second homogeneous cluster of 4 items (4, 14, 18, 25) (ICC = 0.39) that explains 12% of the variance in the scale and acceptable internal consistency (α = 0.73).4 items (11, 12, 19, 26) are uninformative.
Nottingham Neurological Driving Assessment (NNDA) (Lincoln et al., 2012) ND Inter-rater reliability Perfect agreement in the overall decisions Level of agreement for the items: 100% for 7/25 items 13/25 items: discrepancies between ratings of minor errors and no error (safety not compromised in both cases) 5/25 items: discrepancies between ratings of correct or minor errors and major errors (safety compromised) Overall, discrepancies between assessors’ judgements on 6/150 observations (4%)
Driving Observation Schedule (DOS) (Vlahodimitrakou et al., 2013) Face validity Post-drive survey and GPS analysis suggest that the participants’ performance on the DOS is representative of their everyday driving, supporting the DOS face validity Inter-rater reliability Excellent reliability between observers: ICC = 0.91 (IC 95% 0.747–0.965; p<0.0001) and significant positive high positive correlation (r = 0.83; p<0.05)
Ecologic validity Comparison between the DOS trips and participants’ everyday driving with the use of a GPS over 4 months. Significant difference (p<0.0001) in terms of distance and duration. Majority of time spent on 50 and 60km/h roads during everyday driving and the DOS route. Significant difference (p<0.05) between time spent on 50km/h roads (DOS>everyday) and 80km/h roads (DOS<everyday). No difference in time driving on 40, 60, 70, 90 and 100km/h roads. Similarity between the DOS trips and participants’ everyday driving supported by GPS data. Ecological validity supported by these data. Erreur de mesure SEM = 3% ME = 2.9% CV = 3.3% Small measures of SEM, ME and CV suggesting a high level of absolute reliability
Record of Driving Errors (RODE) (Barco et al. 2015) AD Inter-rater reliability Good to excellent reliability for the main categories of driving errors. For example: closed route (ICC = 0.84), low traffic (ICC = 0.90) and moderate to high traffic (ICC = 0.97). For total operational errors: ICC = 0.91. For total tactical errors: ICC = 0.95. However, low reliability for some driving errors explainable by a lower frequency of occurrence of those specific errors. For example, errors in pedal control: ICC = -0.02
Western University's (UWO) on-road assessment Face validity Development of the course and extraction of the main on-road components from on-road studies: involvement of a certified driving rehabilitation specialist and a certified driving instructor. These data support the UWO face validity (Classen et al., 2016a) Inter-rater reliability Near-perfect level of agreement between the driving rehabilitation specialist and the driving instructor for the GRS by two levels (κ = 0.892; p<0.0001) and by four levels (κ = 0.952; p<0.0001) Near-perfect level of agreement for the PERS for the first type of errors (κ = 0.888; p<0.0001), for the second (κ = 0.847; p<0.0001) and for the third (κ = 0.902; p<0.0001) (Classen et al., 2016b)
Content validity Use of a Content Validity Matrix that indicates the level of agreement (Content Validity Index) between each source (main studies on the topic of course development) and the UWO on-road course components. Several drives on the roadways to refine the course. Excellent content validity (100% agreement between the UWO on-road course and the documented on-road components identified in the literature) (Classen et al., 2016a)
Construct validity Tested with the known-groups method. Good construct validity: more people with MS in the fail group and more severe form of MS (progressive VS relapsing-remitting) in that group compared to the pass group (Classen et al., 2016a)

PCA: principal component analysis; ρ: Spearman correlation coefficient; p: p-value; R2: determination coefficient (square of Pearson coefficient); ICC: intraclass correlation coefficient; CI: confidence interval; VPP: positive predictive value; NPV: negative predictive value; MnSq: mean-square; ROC: receiver operating characteristic; AUC: area under the curve; CVA: cerebrovascular accident; MCI: mild cognitive impairment; κ: Cohen’s kappa; α: Cronbach alpha; GRS: global rating scale; PERS: priority error rating score; MS: multiple sclerosis; ND: no data; r: Pearson correlation coefficient; τ-b: Kendall rank correlation coefficient; SEM: standard error of the measurement; ME: method error; CV: coefficient of variation; CARA: Center for Determination of Fitness to Drive and Car