Skip to main content
. Author manuscript; available in PMC: 2022 Jan 1.
Published in final edited form as: J Clin Child Adolesc Psychol. 2020 Sep 11;50(1):58–76. doi: 10.1080/15374416.2020.1802736

Table 1:

Rubric for evaluating norms, validity, and utility (De Los Reyes & Langer, 2018)

Criterion Adequate Good Excellent
Norms M and SD for total score (and subscores if relevant) from a large, relevant clinical sample M and SD for total score (and subscores if relevant) from multiple large, relevant samples, at least one clinical and one nonclinical Same as “good,” but must be from representative sample (i.e., random sampling, or matching to census data)
Internal consistency (Cronbach’s alpha, split half, etc.) Most evidence shows alpha values of 0.70–0.79 Most reported alphas 0.80–0.89 Most reported alphas ≥0.90
Inter-rater reliability Most evidence shows kappas of 0.60–0.74, or ICCs of 0.70–0.79 Most reported kappas of 0.75–0.84, ICCs of 0.80–0.89 Most kappas ≥0.85, or ICCs ≥0.90
Test–retest reliability (stability) Most evidence shows test–retest correlations ≥0.70 over period of several days or weeks Most evidence shows test–retest correlations ≥0.70 over period of several months Most evidence shows test–retest correlations ≥0.70 over 1 year or longer
Repeatability Bland–Altman (Bland Altman, 1986) plots show small bias, and/or weak trends; coefficient of repeatability is tolerable compared to clinical benchmarks (Vaz, Falkmer, Passmore, Parsons, Andreou, 2013) Bland–Altman plots and corresponding regressions show no significant bias, and no significant trends; coefficient of repeatability is tolerable Bland–Altman plots and corresponding regressions show no significant bias, and no significant trends; established for multiple studies; coefficient of repeatability is small enough that it is not clinically concerning
Content validity Test developers clearly defined domain and ensured representation of entire set of facets Same as “adequate,” plus all elements (items, instructions) evaluated by judges (experts or pilot participants) Same as “good,” plus multiple groups of judges and quantitative ratings
Construct validity (e.g., predictive, concurrent, convergent, and discriminant validity) Some independently replicated evidence of construct validity Bulk of independently replicated evidence shows multiple aspects of construct validity Same as “good,” plus evidence of incremental validity with respect to other clinical data
Discriminative validity Statistically significant discrimination in multiple samples; AUCs <0.6 under clinically realistic conditions (i.e., not comparing treatment seeking and healthy youth) AUCs of 0.60 to <0.75 under clinically realistic conditions AUCs of 0.75 to 0.90 under clinically realistic conditions
Prescriptive validity Statistically significant accuracy at identifying a diagnosis with a well-specified matching intervention, or statistically significant moderator of treatment Same as “adequate,” with good kappa for diagnosis, or significant treatment moderation in more than one sample Same as “good,” with good kappa for diagnosis in more than one sample, or moderate effect size for treatment moderation
Validity generalization Some evidence supports use with either more than one specific demographic group or in more than one setting Bulk of evidence supports use with either more than one specific demographic group or in multiple settings Bulk of evidence supports use with either more than one specific demographic group AND in multiple settings
Treatment sensitivity Some evidence of sensitivity to change over course of treatment Independent replications show evidence of sensitivity to change over course of treatment Same as “good,” plus sensitive to change across different types of treatments
Clinical utility After practical considerations (e.g., costs, respondent burden, ease of administration and scoring, availability of relevant benchmark scores, patient acceptability), assessment data are likely to be clinically actionable Same as “adequate,” plus published evidence that using the assessment data confers clinical benefit (e.g., better outcome, lower attrition, greater satisfaction), in areas important to stakeholders Same as “good,” plus independent replication

Note: ICC = intraclass correlation coefficient; AUC = area under the curve. Table reproduced with permission.