. 2022 Mar 1;27(2):405–425. doi: 10.1007/s10459-022-10092-z

Table 3.

AIG validity assessment

Inferences		(Gierl et al., 2012a, b)	(Gierl & Lai, 2013a)	(Gierl & Lai, 2013b)	(Gierl et al., 2016)	(Gierl & Lai, 2016)	(Gierl & Lai, 2018)	(Lai et al., 2016a, b)	(Pugh et al., 2016)	(Pugh et al., 2020)	(Shappell et al., 2020)
Proposed use of AIG		Generate MCQs for medical licensure testing.	Generate MCQs for medical licensure testing.	Generate MCQs for medical licensure testing.	Generate MCQs for medical licensure testing.	Generate MCQs for medical assessment.	Generate MCQs and rationales for medical formative testing.	Generate MCQs and distractors for medical licensure testing.	Generate MCQs for medical assessment.	Generate MCQs for medical assessment.	Generate MCQs for medical mastery learning assessment.
Scoring	Existing evidence	Cognitive and item models were developed and reviewed by specialists.	Items were blindly evaluated for quality by a panel of experts.	Cognitive and item models were developed and reviewed by specialists.	Cognitive and item models were developed and reviewed by specialists.	Experts evaluated the content and the logic specified in the cognitive model and in the item model.	Experts blindly reviewed the rationales generated for formative testing.	Cognitive and item models were developed and reviewed by specialists.	Cognitive and item models were developed and reviewed by specialists.	Quality of items generated was evaluated by experts.	Item models were developed and reviewed by specialists.
Generalisation	Existing evidence	UN	UN	UN	Item response theory was used, but not reported. CTT was used. Generated items measured a broad range of difficulty levels.	UN	UN	CTT was used. Generated items measured a broad range of difficulty levels;	UN	UN	No significant differences in item difficulty between tests were found.
Extrapolation	Existing evidence	UN	UN	UN	Consistent levels of item discrimination.	UN	UN	Consistent levels of item discrimination.	UN	UN	No significant differences in mean item discrimination between tests were found.
Implications	Existing evidence	UN	UN	UN	UN	UN	UN	UN	UN	UN	UN

*UN - Unclear / Unreported