Table 2.
Classification based on the evaluation approach of the content generated by the artificial intelligence–based models.
| Authors | Evaluation of performance | Individual role and interrater reliability |
| Al-Ashwal et al [42] | Objective via 2 different clinical reference tools | Not applicable |
| Alfertshofer et al [43] | Objective based on the key answers, with the questions screened independently by 4 investigators | Not applicable |
| Ali et al [44] | Objective for multiple-choice questions and true or false questions, and subjective for short-answer and assay questions | Assessment by 2 assessors independently with intraclass correlation coefficient for agreement |
| Aljindan et al [45] | Objective based on key answers and historical performance metrics | Not applicable |
| Altamimi et al [46] | Subjective | Not clear; Assessment for accuracy, informativeness, and accessibility by clinical toxicologists and emergency medicine physicians |
| Baglivo et al [47] | Objective based on key answers and comparison with 5th year medical students’ performance | Not applicable |
| Biswas et al [48] | Subjective by a 5-member team of optometry teaching and expert staff with over 100 years of clinical and academic experience between them; Independent evaluation on a 5-point Likert scale ranging from very poor to very good | The median scores across raters for each response were studied; The score represented rater consensus, while the score variance represented disagreements between the raters |
| Chen et al [49] | Objective based on key answers | Not applicable |
| Deiana et al [50] | Subjective based on qualitative assessment of correctness, clarity, and exhaustiveness; Each response rated using a 4-point Likert scale scored from strongly disagree to strongly agree | Independent assessment by 2 raters with experience in vaccination and health communication topics |
| Fuchs et al [51] | Objective based on key answers | Not applicable |
| Ghosh & Bir [52] | Objective based on key answers; Subjectivity by raters’ assessment | Scoring by 2 assessors on a scale of 0 to 5, with 0 being incorrect and 5 being fully correct, based on a preselected answer key |
| Giannos [53] | Objective based on key answers | Not applicable |
| Gobira et al [54] | Objective based on key answers, with an element of subjectivity through classifying the responses as adequate, inadequate, or indeterminate | Two raters independently scored the accuracy; After individual evaluations, the raters performed a third assessment to reach a consensus on the questions with differing results |
| Grewal et al [55] | Not clear | Not clear |
| Guerra et al [56] | Subjective through comparison with the results of a previous study on the average performance of users, and a cohort of medical students and neurosurgery residents | Not applicable |
| Hamed et al [57] | Subjective | Not clear |
| Hoch et al [58] | Objective based on key answers | Not applicable |
| Juhi et al [59] | Subjective and the use of Stockley’s Drug Interactions Pocket Companion 2015 as a reference key | Two raters reached a consensus for categorizing the output |
| Kuang et al [60] | Subjective | Not clear |
| Kumari et al [61] | Subjective; Content validity checked by 2 experts of curriculum design | Three independent raters scored content based on their correctness, with an accuracy score ranging from 1 to 5 |
| Kung et al [62] | Objective based on key answers | Not applicable |
| Lai et al [63] | Objective based on key answers | Not applicable |
| Lyu et al [64] | Subjective | Two experienced radiologists (with 21 and 8 years of experience) evaluated the quality of the ChatGPT responses |
| Moise et al [65] | Subjective through comparison with the latest American Academy of Otolaryngology–Head and Neck Surgery Foundation Clinical Practice Guideline: Tympanostomy Tubes in Children (Update) | Two independent raters evaluated the output; The interrater reliability was assessed using the Cohen κ test; To confirm consensus, responses were reviewed by the senior author |
| Oca et al [66] | Not clear | Not clear |
| Oztermeli & Oztermeli [67] | Objective based on key answers | Not applicable |
| Pugliese et al [68] | Subjective using the Likert scale for accuracy, completeness, and comprehensiveness | Multirater: 10 key opinion leaders in nonalcoholic fatty liver disease and 1 nonphysician with expertise in patient advocacy in liver disease independently rating the AIa content |
| Sallam et al [69] | Subjective based on correctness, clarity, and conciseness | Fleiss multirater κ |
| Seth et al [70] | Subjective through comparison with the current health care guidelines for rhinoplasty, and evaluation by a panel of plastic surgeons through a Likert scale to assess the readability and complexity of the text and the education level required for understanding, and the modified DISCERNb score | Not clear |
| Suthar et al [71] | Subjective by 3 fellowship-trained neuroradiologists, using a 5-point Likert scale, with 1 indicating “highly improbable” and 5 indicating “highly probable” | Not applicable |
| Walker et al [72] | Modified EQIPc Tool with comparison with UK National Institute for Health and Care Excellence guidelines for gallstone disease, pancreatitis, liver cirrhosis, or portal hypertension, and the European Association for Study of the Liver guidelines | All answers were assessed by 2 authors independently, and in case of a contradictory result, resolution was achieved by consensus; The process was repeated 3 times per EQIP item; Wrong or out of context answers, known as “AI hallucinations,” were recorded |
| Wang et al [73] | Subjective | Unclear |
| Wang et al [74] | Objective based on key answers | Not applicable |
| Zhou et al [75] | Subjective | Unclear |
aAI: artificial intelligence.
bDISCERN is an instrument for judging the quality of written consumer health information on treatment choices.
cEQIP: Ensuring Quality Information for Patients.