Content |
Test items are relevant and representative of the intended construct |
Using expert opinions to ensure all domains are accurately represented |
N = |
No discussion of instrument content (includes simply listing items without justification) |
0 = |
Discussion but no data |
1 = |
Listing assessment themes with little or no reference to a theoretical basis, or a poorly defined process for creating and reviewing items |
2 = |
Well‐defined process for developing instrument content, including both an explicit theoretical/conceptual basis for instrument items and systematic item review by experts alternatively, reference to a prior study on an assessment instrument that meets these criteria |
Response process |
Thought processes and actions of subjects and observers are made in accordance with the intended construct |
Quality control of assessments, such as in standardizing test administration, minimizing examiner bias, and providing a specific test for the task |
N = |
No discussion. Merely disclosing response rates or numbers of respondents does not constitute evidence |
0 = |
Discussion but no data. Discussing the impact of response rate on assessment scores, or speculating on the thought processes of learners, does not constitute evidence |
1 = |
Minimal data regarding thought processes and analysis of responses. Description (without data) of systems that reduce response error, such as computer‐scored forms |
2 = |
Multiple sources of supportive data, including critical examination of thought processes, analysis of responses for evidence of halo error or rater leniency, or data demonstrating low response error |
Internal structure |
Test scores across tasks can be reliably reproduced |
Calculating interitem reliability and test-retest reliability |
N = |
No discussion |
0 = |
Discussion but no data |
1 = |
Factor analysis incompletely confirming anticipated data structure, or acceptable reliability with a single measure |
2 = |
Factor analysis confirming anticipated data structure, or multiple measures of reliability. Variation in responses to specific items among subgroups (differential item functioning) can support or challenge internal structure depending on predictions |
Relations to other variables |
Test scores correlate with external, independent measures that share a theoretic relationship |
Comparing scores between groups with different levels of experience in the tested skill |
N = |
No discussion |
0 = |
Discussion but no data |
1 = |
Correlation of assessment scores to outcomes with minimal theoretical importance, or unanticipated score correlations |
2 = |
Correlation (convergence) or no correlation (divergence) between assessment scores and theoretically predicted outcomes or measures of the same construct. Such evidence will usually be integral to the study design, and anticipated a priori |
Consequences |
The impact of using the assessment |
Determining the pass-fail score and considerations for the subject on obtaining a pass or fail, promotion, or privilege |
N = |
No discussion. Speculation on potential applications of the assessment does not constitute evidence |
0 = |
Discussion but no data. Simply discussing the consequences of assessment (eg, data regarding usefulness or faculty approval) without linking this to validity does not constitute evidence |
1 = |
Description of consequences of assessment that could conceivably impact the validity of score interpretations (although these impacts are not explicitly identified by the authors) |
2 = |
Description of consequences of assessment that clearly impact on the validity of score interpretations, as supported by data and convincingly argued by the authors. Such evidence will usually be integral to the study design, and anticipated a prior |