. 2022 Nov 22;30(3):366–382. doi: 10.1177/15533506221140506

Appendices Appendix 1Table A1.

Framework to Evaluate Validity Adapted From Ref. [4,5]

Parameter	Criteria	Example	Rating
Content	Test items are relevant and representative of the intended construct	Using expert opinions to ensure all domains are accurately represented	N =	No discussion of instrument content (includes simply listing items without justification)
			0 =	Discussion but no data
			1 =	Listing assessment themes with little or no reference to a theoretical basis, or a poorly defined process for creating and reviewing items
			2 =	Well‐defined process for developing instrument content, including both an explicit theoretical/conceptual basis for instrument items and systematic item review by experts alternatively, reference to a prior study on an assessment instrument that meets these criteria
Response process	Thought processes and actions of subjects and observers are made in accordance with the intended construct	Quality control of assessments, such as in standardizing test administration, minimizing examiner bias, and providing a specific test for the task	N =	No discussion. Merely disclosing response rates or numbers of respondents does not constitute evidence
			0 =	Discussion but no data. Discussing the impact of response rate on assessment scores, or speculating on the thought processes of learners, does not constitute evidence
			1 =	Minimal data regarding thought processes and analysis of responses. Description (without data) of systems that reduce response error, such as computer‐scored forms
			2 =	Multiple sources of supportive data, including critical examination of thought processes, analysis of responses for evidence of halo error or rater leniency, or data demonstrating low response error
Internal structure	Test scores across tasks can be reliably reproduced	Calculating interitem reliability and test-retest reliability	N =	No discussion
			0 =	Discussion but no data
			1 =	Factor analysis incompletely confirming anticipated data structure, or acceptable reliability with a single measure
			2 =	Factor analysis confirming anticipated data structure, or multiple measures of reliability. Variation in responses to specific items among subgroups (differential item functioning) can support or challenge internal structure depending on predictions
Relations to other variables	Test scores correlate with external, independent measures that share a theoretic relationship	Comparing scores between groups with different levels of experience in the tested skill	N =	No discussion
			0 =	Discussion but no data
			1 =	Correlation of assessment scores to outcomes with minimal theoretical importance, or unanticipated score correlations
			2 =	Correlation (convergence) or no correlation (divergence) between assessment scores and theoretically predicted outcomes or measures of the same construct. Such evidence will usually be integral to the study design, and anticipated a priori
Consequences	The impact of using the assessment	Determining the pass-fail score and considerations for the subject on obtaining a pass or fail, promotion, or privilege	N =	No discussion. Speculation on potential applications of the assessment does not constitute evidence
			0 =	Discussion but no data. Simply discussing the consequences of assessment (eg, data regarding usefulness or faculty approval) without linking this to validity does not constitute evidence
			1 =	Description of consequences of assessment that could conceivably impact the validity of score interpretations (although these impacts are not explicitly identified by the authors)
			2 =	Description of consequences of assessment that clearly impact on the validity of score interpretations, as supported by data and convincingly argued by the authors. Such evidence will usually be integral to the study design, and anticipated a prior