Table 1:
Criterion | Adequate | Good | Excellent |
---|---|---|---|
Norms | M and SD for total score (and subscores if relevant) from a large, relevant clinical sample | M and SD for total score (and subscores if relevant) from multiple large, relevant samples, at least one clinical and one nonclinical | Same as “good,” but must be from representative sample (i.e., random sampling, or matching to census data) |
Internal consistency (Cronbach’s alpha, split half, etc.) | Most evidence shows alpha values of 0.70–0.79 | Most reported alphas 0.80–0.89 | Most reported alphas ≥0.90 |
Inter-rater reliability | Most evidence shows kappas of 0.60–0.74, or ICCs of 0.70–0.79 | Most reported kappas of 0.75–0.84, ICCs of 0.80–0.89 | Most kappas ≥0.85, or ICCs ≥0.90 |
Test–retest reliability (stability) | Most evidence shows test–retest correlations ≥0.70 over period of several days or weeks | Most evidence shows test–retest correlations ≥0.70 over period of several months | Most evidence shows test–retest correlations ≥0.70 over 1 year or longer |
Repeatability | Bland–Altman (Bland Altman, 1986) plots show small bias, and/or weak trends; coefficient of repeatability is tolerable compared to clinical benchmarks (Vaz, Falkmer, Passmore, Parsons, Andreou, 2013) | Bland–Altman plots and corresponding regressions show no significant bias, and no significant trends; coefficient of repeatability is tolerable | Bland–Altman plots and corresponding regressions show no significant bias, and no significant trends; established for multiple studies; coefficient of repeatability is small enough that it is not clinically concerning |
Content validity | Test developers clearly defined domain and ensured representation of entire set of facets | Same as “adequate,” plus all elements (items, instructions) evaluated by judges (experts or pilot participants) | Same as “good,” plus multiple groups of judges and quantitative ratings |
Construct validity (e.g., predictive, concurrent, convergent, and discriminant validity) | Some independently replicated evidence of construct validity | Bulk of independently replicated evidence shows multiple aspects of construct validity | Same as “good,” plus evidence of incremental validity with respect to other clinical data |
Discriminative validity | Statistically significant discrimination in multiple samples; AUCs <0.6 under clinically realistic conditions (i.e., not comparing treatment seeking and healthy youth) | AUCs of 0.60 to <0.75 under clinically realistic conditions | AUCs of 0.75 to 0.90 under clinically realistic conditions |
Prescriptive validity | Statistically significant accuracy at identifying a diagnosis with a well-specified matching intervention, or statistically significant moderator of treatment | Same as “adequate,” with good kappa for diagnosis, or significant treatment moderation in more than one sample | Same as “good,” with good kappa for diagnosis in more than one sample, or moderate effect size for treatment moderation |
Validity generalization | Some evidence supports use with either more than one specific demographic group or in more than one setting | Bulk of evidence supports use with either more than one specific demographic group or in multiple settings | Bulk of evidence supports use with either more than one specific demographic group AND in multiple settings |
Treatment sensitivity | Some evidence of sensitivity to change over course of treatment | Independent replications show evidence of sensitivity to change over course of treatment | Same as “good,” plus sensitive to change across different types of treatments |
Clinical utility | After practical considerations (e.g., costs, respondent burden, ease of administration and scoring, availability of relevant benchmark scores, patient acceptability), assessment data are likely to be clinically actionable | Same as “adequate,” plus published evidence that using the assessment data confers clinical benefit (e.g., better outcome, lower attrition, greater satisfaction), in areas important to stakeholders | Same as “good,” plus independent replication |
Note: ICC = intraclass correlation coefficient; AUC = area under the curve. Table reproduced with permission.