. 2024 Feb 15;13:e54704. doi: 10.2196/54704

Table 4.

Included records that had the highest METRICS score per item.

Item	Issues considered in each item	Excellent or very good reporting examples
#1 Model	What is the model of the generative AI^a tool used for generating content, and what are the exact settings for each tool?	Baglivo et al [47]: Bing, ChatGPT, Chatsonic, Bard, and YouChat, with full details of the mode and large language model, including plugins
#2 Evaluation	What is the exact approach used to evaluate the content generated by the generative AI-based model and is it an objective or subjective evaluation?	Al-Ashwal et al [42]: Objective via 2 different clinical reference tools; Alfertshofer et al [43]: Objective based on key answers, with the questions screened independently by 4 investigators; Ali et al [44]: Objective for multiple-choice questions and true or false questions, and subjective for short-answer and assay questions; Aljindan et al [45]: Objective based on key answers and historical performance metrics; and Baglivo et al [47]: Objective based on key answers and comparison with 5th year medical students’ performance
#3a Timing	When is the generative AI model tested exactly and what are the duration and timing of testing?	Baglivo et al [47]; Biswas et al [48]; Fuchs et al [51]; Ghosh & Bir [52]; Hoch et al [58]; Juhi et al [59]; Kumari et al [61]; Kung et al [62]; Oca et al [66]; Pugliese et al [68]; Sallam et al [69]; Wang et al [73]; and Zhou et al [75]
#3b Transparency	How transparent are the data sources used to generate queries for the generative AI-based model?	Alfertshofer et al [43]
#4a Range	What is the range of topics tested and are they intersubject or intrasubject with variability in different subjects?	Ali et al [44]; Chen et al [49]; Hoch et al [58]; and Wang et al [73]
#4b Randomization	Was the process of selecting the topics to be tested on the generative AI-based model randomized?	Alfertshofer et al [43] and Aljindan et al [45]
#5 Individual	Is there any individual subjective involvement in generative AI content evaluation? If so, did the authors describe the details in full?	Ali et al [44] and Moise et al [65]
#6 Count	What is the count of queries executed (sample size)?	Alfertshofer et al [43]; Chen et al [49]; Guerra et al [56]; Hoch et al [58]; and Oztermeli & Oztermeli [67]
#7 Specificity of the prompt or language	How specific are the exact prompts used? Were those exact prompts provided fully? Did the authors consider the feedback and learning loops? How specific are the language and cultural issues considered in the generative AI model?	Alfertshofer et al [43]; Biswas et al [48]; Fuchs et al [51]; Grewal et al [55]; Wang et al [73]; Moise et al [65]; and Pugliese et al [68]

^aAI: artificial intelligence.