Table 3.
AIG validity assessment
| Inferences | (Gierl et al., 2012a, b) | (Gierl & Lai, 2013a) | (Gierl & Lai, 2013b) | (Gierl et al., 2016) | (Gierl & Lai, 2016) | (Gierl & Lai, 2018) | (Lai et al., 2016a, b) | (Pugh et al., 2016) | (Pugh et al., 2020) | (Shappell et al., 2020) | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Proposed use of AIG | Generate MCQs for medical licensure testing. | Generate MCQs for medical licensure testing. | Generate MCQs for medical licensure testing. | Generate MCQs for medical licensure testing. |
Generate MCQs for medical assessment. |
Generate MCQs and rationales for medical formative testing. | Generate MCQs and distractors for medical licensure testing. | Generate MCQs for medical assessment. | Generate MCQs for medical assessment. | Generate MCQs for medical mastery learning assessment. | |
| Scoring |
Existing evidence |
Cognitive and item models were developed and reviewed by specialists. | Items were blindly evaluated for quality by a panel of experts. | Cognitive and item models were developed and reviewed by specialists. | Cognitive and item models were developed and reviewed by specialists. | Experts evaluated the content and the logic specified in the cognitive model and in the item model. | Experts blindly reviewed the rationales generated for formative testing. | Cognitive and item models were developed and reviewed by specialists. | Cognitive and item models were developed and reviewed by specialists. | Quality of items generated was evaluated by experts. | Item models were developed and reviewed by specialists. |
| Generalisation |
Existing evidence |
UN | UN | UN | Item response theory was used, but not reported. CTT was used. Generated items measured a broad range of difficulty levels. | UN | UN | CTT was used. Generated items measured a broad range of difficulty levels; | UN | UN | No significant differences in item difficulty between tests were found. |
| Extrapolation | Existing evidence | UN | UN | UN | Consistent levels of item discrimination. | UN | UN | Consistent levels of item discrimination. | UN | UN | No significant differences in mean item discrimination between tests were found. |
| Implications | Existing evidence | UN | UN | UN | UN | UN | UN | UN | UN | UN | UN |
*UN - Unclear / Unreported