. 2024 Feb 15;13:e54704. doi: 10.2196/54704

Table 2.

Classification based on the evaluation approach of the content generated by the artificial intelligence–based models.

Authors	Evaluation of performance	Individual role and interrater reliability
Al-Ashwal et al [42]	Objective via 2 different clinical reference tools	Not applicable
Alfertshofer et al [43]	Objective based on the key answers, with the questions screened independently by 4 investigators	Not applicable
Ali et al [44]	Objective for multiple-choice questions and true or false questions, and subjective for short-answer and assay questions	Assessment by 2 assessors independently with intraclass correlation coefficient for agreement
Aljindan et al [45]	Objective based on key answers and historical performance metrics	Not applicable
Altamimi et al [46]	Subjective	Not clear; Assessment for accuracy, informativeness, and accessibility by clinical toxicologists and emergency medicine physicians
Baglivo et al [47]	Objective based on key answers and comparison with 5th year medical students’ performance	Not applicable
Biswas et al [48]	Subjective by a 5-member team of optometry teaching and expert staff with over 100 years of clinical and academic experience between them; Independent evaluation on a 5-point Likert scale ranging from very poor to very good	The median scores across raters for each response were studied; The score represented rater consensus, while the score variance represented disagreements between the raters
Chen et al [49]	Objective based on key answers	Not applicable
Deiana et al [50]	Subjective based on qualitative assessment of correctness, clarity, and exhaustiveness; Each response rated using a 4-point Likert scale scored from strongly disagree to strongly agree	Independent assessment by 2 raters with experience in vaccination and health communication topics
Fuchs et al [51]	Objective based on key answers	Not applicable
Ghosh & Bir [52]	Objective based on key answers; Subjectivity by raters’ assessment	Scoring by 2 assessors on a scale of 0 to 5, with 0 being incorrect and 5 being fully correct, based on a preselected answer key
Giannos [53]	Objective based on key answers	Not applicable
Gobira et al [54]	Objective based on key answers, with an element of subjectivity through classifying the responses as adequate, inadequate, or indeterminate	Two raters independently scored the accuracy; After individual evaluations, the raters performed a third assessment to reach a consensus on the questions with differing results
Grewal et al [55]	Not clear	Not clear
Guerra et al [56]	Subjective through comparison with the results of a previous study on the average performance of users, and a cohort of medical students and neurosurgery residents	Not applicable
Hamed et al [57]	Subjective	Not clear
Hoch et al [58]	Objective based on key answers	Not applicable
Juhi et al [59]	Subjective and the use of Stockley’s Drug Interactions Pocket Companion 2015 as a reference key	Two raters reached a consensus for categorizing the output
Kuang et al [60]	Subjective	Not clear
Kumari et al [61]	Subjective; Content validity checked by 2 experts of curriculum design	Three independent raters scored content based on their correctness, with an accuracy score ranging from 1 to 5
Kung et al [62]	Objective based on key answers	Not applicable
Lai et al [63]	Objective based on key answers	Not applicable
Lyu et al [64]	Subjective	Two experienced radiologists (with 21 and 8 years of experience) evaluated the quality of the ChatGPT responses
Moise et al [65]	Subjective through comparison with the latest American Academy of Otolaryngology–Head and Neck Surgery Foundation Clinical Practice Guideline: Tympanostomy Tubes in Children (Update)	Two independent raters evaluated the output; The interrater reliability was assessed using the Cohen κ test; To confirm consensus, responses were reviewed by the senior author
Oca et al [66]	Not clear	Not clear
Oztermeli & Oztermeli [67]	Objective based on key answers	Not applicable
Pugliese et al [68]	Subjective using the Likert scale for accuracy, completeness, and comprehensiveness	Multirater: 10 key opinion leaders in nonalcoholic fatty liver disease and 1 nonphysician with expertise in patient advocacy in liver disease independently rating the AI^a content
Sallam et al [69]	Subjective based on correctness, clarity, and conciseness	Fleiss multirater κ
Seth et al [70]	Subjective through comparison with the current health care guidelines for rhinoplasty, and evaluation by a panel of plastic surgeons through a Likert scale to assess the readability and complexity of the text and the education level required for understanding, and the modified DISCERN^b score	Not clear
Suthar et al [71]	Subjective by 3 fellowship-trained neuroradiologists, using a 5-point Likert scale, with 1 indicating “highly improbable” and 5 indicating “highly probable”	Not applicable
Walker et al [72]	Modified EQIP^c Tool with comparison with UK National Institute for Health and Care Excellence guidelines for gallstone disease, pancreatitis, liver cirrhosis, or portal hypertension, and the European Association for Study of the Liver guidelines	All answers were assessed by 2 authors independently, and in case of a contradictory result, resolution was achieved by consensus; The process was repeated 3 times per EQIP item; Wrong or out of context answers, known as “AI hallucinations,” were recorded
Wang et al [73]	Subjective	Unclear
Wang et al [74]	Objective based on key answers	Not applicable
Zhou et al [75]	Subjective	Unclear

^aAI: artificial intelligence.

^bDISCERN is an instrument for judging the quality of written consumer health information on treatment choices.

^cEQIP: Ensuring Quality Information for Patients.