Table 4.
Included records that had the highest METRICS score per item.
| Item | Issues considered in each item | Excellent or very good reporting examples |
| #1 Model | What is the model of the generative AIa tool used for generating content, and what are the exact settings for each tool? | Baglivo et al [47]: Bing, ChatGPT, Chatsonic, Bard, and YouChat, with full details of the mode and large language model, including plugins |
| #2 Evaluation | What is the exact approach used to evaluate the content generated by the generative AI-based model and is it an objective or subjective evaluation? | Al-Ashwal et al [42]: Objective via 2 different clinical reference tools; Alfertshofer et al [43]: Objective based on key answers, with the questions screened independently by 4 investigators; Ali et al [44]: Objective for multiple-choice questions and true or false questions, and subjective for short-answer and assay questions; Aljindan et al [45]: Objective based on key answers and historical performance metrics; and Baglivo et al [47]: Objective based on key answers and comparison with 5th year medical students’ performance |
| #3a Timing | When is the generative AI model tested exactly and what are the duration and timing of testing? | Baglivo et al [47]; Biswas et al [48]; Fuchs et al [51]; Ghosh & Bir [52]; Hoch et al [58]; Juhi et al [59]; Kumari et al [61]; Kung et al [62]; Oca et al [66]; Pugliese et al [68]; Sallam et al [69]; Wang et al [73]; and Zhou et al [75] |
| #3b Transparency | How transparent are the data sources used to generate queries for the generative AI-based model? | Alfertshofer et al [43] |
| #4a Range | What is the range of topics tested and are they intersubject or intrasubject with variability in different subjects? | Ali et al [44]; Chen et al [49]; Hoch et al [58]; and Wang et al [73] |
| #4b Randomization | Was the process of selecting the topics to be tested on the generative AI-based model randomized? | Alfertshofer et al [43] and Aljindan et al [45] |
| #5 Individual | Is there any individual subjective involvement in generative AI content evaluation? If so, did the authors describe the details in full? | Ali et al [44] and Moise et al [65] |
| #6 Count | What is the count of queries executed (sample size)? | Alfertshofer et al [43]; Chen et al [49]; Guerra et al [56]; Hoch et al [58]; and Oztermeli & Oztermeli [67] |
| #7 Specificity of the prompt or language | How specific are the exact prompts used? Were those exact prompts provided fully? Did the authors consider the feedback and learning loops? How specific are the language and cultural issues considered in the generative AI model? | Alfertshofer et al [43]; Biswas et al [48]; Fuchs et al [51]; Grewal et al [55]; Wang et al [73]; Moise et al [65]; and Pugliese et al [68] |
aAI: artificial intelligence.