. 2023 Jun 21;53(10):1905–1929. doi: 10.1007/s40279-023-01867-4

Table 2.

Type of evidence according to the Standards for Educational and Psychological Testing, American Educational Research Association [1], and how this was applied in this review

Type of validity evidence	Explanation of evidence	Applied in this review
Content evidence	Whether the assessment content (scenarios, questions, response options, and instructions) reflects the intended construct. This might be based on prior instruments, expert review, and/or using a particular framework/model	Considered as partial evidence if only one aspect was performed (e.g., Delphi survey but not a literature review). The literature review did not have to be published separately, just evidence it was performed
Response process evidence	Refers to analyses that evaluate how well the rater’s (or responders’) responses align with the intended construct, including analysis of the thoughts or actions by responders/raters during the assessment	Needed to report evidence of responses for the intended population to be considered as supporting evidence
Internal structure evidence	Refers to data that evaluate the relationship among assessment items and how these relate to the overall construct of interest. This could be measures of reproducibility (reliability) but can also include analysis on items and factors (such as construct validity)	Considered as partial evidence if only one aspect was provided (e.g., an aspect of reliability but no evidence for construct validity)
Relationships with other variables evidence	About the reporting of statistical associations between assessment scores and other measures that have a specified theoretical relationship (includes concurrent validity). This type of validity can be termed criterion related and includes concurrent, predictive, convergent, and discriminate validity	For our purposes, this could include reporting the relationship between physical literacy and: age (would expect a positive association), sex (boys higher in motor skills), motor skills (where a physical literacy instrument has a motor skill component), physical literacy (as measured by another instrument) – (would expect a positive association) and over time
Consequences evidence	About the impact of the assessment itself and any decisions and actions that result (e.g., remediation following a below expected performance) and differences in scores among subgroups where performances ought to be similar	This could include factors that influence such decisions, such as development of a cut off score to indicate poor physical literacy (e.g., at what point can this be determined?)