Skip to main content
. 2024 Feb 15;13:e54704. doi: 10.2196/54704

Table 2.

Classification based on the evaluation approach of the content generated by the artificial intelligence–based models.

Authors Evaluation of performance Individual role and interrater reliability
Al-Ashwal et al [42] Objective via 2 different clinical reference tools Not applicable
Alfertshofer et al [43] Objective based on the key answers, with the questions screened independently by 4 investigators Not applicable
Ali et al [44] Objective for multiple-choice questions and true or false questions, and subjective for short-answer and assay questions Assessment by 2 assessors independently with intraclass correlation coefficient for agreement
Aljindan et al [45] Objective based on key answers and historical performance metrics Not applicable
Altamimi et al [46] Subjective Not clear; Assessment for accuracy, informativeness, and accessibility by clinical toxicologists and emergency medicine physicians
Baglivo et al [47] Objective based on key answers and comparison with 5th year medical students’ performance Not applicable
Biswas et al [48] Subjective by a 5-member team of optometry teaching and expert staff with over 100 years of clinical and academic experience between them; Independent evaluation on a 5-point Likert scale ranging from very poor to very good The median scores across raters for each response were studied; The score represented rater consensus, while the score variance represented disagreements between the raters
Chen et al [49] Objective based on key answers Not applicable
Deiana et al [50] Subjective based on qualitative assessment of correctness, clarity, and exhaustiveness; Each response rated using a 4-point Likert scale scored from strongly disagree to strongly agree Independent assessment by 2 raters with experience in vaccination and health communication topics
Fuchs et al [51] Objective based on key answers Not applicable
Ghosh & Bir [52] Objective based on key answers; Subjectivity by raters’ assessment Scoring by 2 assessors on a scale of 0 to 5, with 0 being incorrect and 5 being fully correct, based on a preselected answer key
Giannos [53] Objective based on key answers Not applicable
Gobira et al [54] Objective based on key answers, with an element of subjectivity through classifying the responses as adequate, inadequate, or indeterminate Two raters independently scored the accuracy; After individual evaluations, the raters performed a third assessment to reach a consensus on the questions with differing results
Grewal et al [55] Not clear Not clear
Guerra et al [56] Subjective through comparison with the results of a previous study on the average performance of users, and a cohort of medical students and neurosurgery residents Not applicable
Hamed et al [57] Subjective Not clear
Hoch et al [58] Objective based on key answers Not applicable
Juhi et al [59] Subjective and the use of Stockley’s Drug Interactions Pocket Companion 2015 as a reference key Two raters reached a consensus for categorizing the output
Kuang et al [60] Subjective Not clear
Kumari et al [61] Subjective; Content validity checked by 2 experts of curriculum design Three independent raters scored content based on their correctness, with an accuracy score ranging from 1 to 5
Kung et al [62] Objective based on key answers Not applicable
Lai et al [63] Objective based on key answers Not applicable
Lyu et al [64] Subjective Two experienced radiologists (with 21 and 8 years of experience) evaluated the quality of the ChatGPT responses
Moise et al [65] Subjective through comparison with the latest American Academy of Otolaryngology–Head and Neck Surgery Foundation Clinical Practice Guideline: Tympanostomy Tubes in Children (Update) Two independent raters evaluated the output; The interrater reliability was assessed using the Cohen κ test; To confirm consensus, responses were reviewed by the senior author
Oca et al [66] Not clear Not clear
Oztermeli & Oztermeli [67] Objective based on key answers Not applicable
Pugliese et al [68] Subjective using the Likert scale for accuracy, completeness, and comprehensiveness Multirater: 10 key opinion leaders in nonalcoholic fatty liver disease and 1 nonphysician with expertise in patient advocacy in liver disease independently rating the AIa content
Sallam et al [69] Subjective based on correctness, clarity, and conciseness Fleiss multirater κ
Seth et al [70] Subjective through comparison with the current health care guidelines for rhinoplasty, and evaluation by a panel of plastic surgeons through a Likert scale to assess the readability and complexity of the text and the education level required for understanding, and the modified DISCERNb score Not clear
Suthar et al [71] Subjective by 3 fellowship-trained neuroradiologists, using a 5-point Likert scale, with 1 indicating “highly improbable” and 5 indicating “highly probable” Not applicable
Walker et al [72] Modified EQIPc Tool with comparison with UK National Institute for Health and Care Excellence guidelines for gallstone disease, pancreatitis, liver cirrhosis, or portal hypertension, and the European Association for Study of the Liver guidelines All answers were assessed by 2 authors independently, and in case of a contradictory result, resolution was achieved by consensus; The process was repeated 3 times per EQIP item; Wrong or out of context answers, known as “AI hallucinations,” were recorded
Wang et al [73] Subjective Unclear
Wang et al [74] Objective based on key answers Not applicable
Zhou et al [75] Subjective Unclear

aAI: artificial intelligence.

bDISCERN is an instrument for judging the quality of written consumer health information on treatment choices.

cEQIP: Ensuring Quality Information for Patients.