Table 3.
Publication | LLM | Application domain | Topic | Questions number and type | Runs | Intervention | Rater | Grading | Outcome |
---|---|---|---|---|---|---|---|---|---|
Johnson et al. (2023)14 | GPT-3.5 | Medical knowledge; translation/summary | Various cancer entities | 13; OE | 5 | FAQs on cancer myths derived from an online patient forum, comparing AI-generated responses to the original source material. | 5 Exp. |
Binary accuracy (yes/no); Readability (FKGL*) |
LLM accuracy: 0.969; NCI accuracy: 1.0. LLMs consistently answered repetitive questions accurately. Both sources had lower readability levels than health literacy guidelines recommend. |
Schulte (2023)15 | GPT-3.5 | Medical knowledge | Metastatic solid tumours | 51; SATA; PE | 1 | Prompted the LLM to list therapies and compared the total recommended treatments to NCCN guidelines. | Author | “Valid Therapy Quotient” (VQT) | VQT of 0.77; 77% of named therapies aligned with guidelines. |
Coskun et al. (2023)16 | GPT-3.5 | Medical knowledge | Prostate cancer | 59; OE | NR | FAQs on prevention, aetiology, diagnosis, prognosis, and therapy from the official European Association of Urology (EAU) patient forum, compared to the reference source. | 2 Exp. |
Multidimensional: Accuracy includes true positive (TP), false positive (FP), true negative (TN), and false negative (FN). Similarity is measured by cosine similarity. |
Precision: 0.426; Recall: 0.549; F1 Score: 0.426; Cosine similarity: 0.609; Mean GQS: 3.62. |
Chen et al. (2023)17 | GPT-3.5 | Medical knowledge |
Lung; prostate; breast cancer |
104; OE; PE | NR | Evaluated treatment prompts in four styles against NCCN guidelines. | 3 Exp. | Multidimensional: Uses a self-developed 5-item grading system to assess the number of recommended treatments and their concordance with guidelines. | LLM provided at least one NCCN-concordant treatment: 1.0; non-concordant treatment: 0.343; hallucinated responses: 0.125; interrater agreement: 0.62. |
Lombardo et al. (2024)18 | GPT-3.5 | Medical knowledge | Prostate cancer | 195; OE | NR | Prompts were drafted from GL recommendations on Classification; Diagnosis; Treatment; Follow-Up; QoL and compared to reference source; assessed on correctness | 2 Exp. | 4-point Likert-scale: 1—Completely correct; 2—Correct but inadequate; 3—Mix of correct and misleading; 4—Completely incorrect. | Completely correct: 0.26; Correct but inadequate: 0.26; Mix of correct and misleading: 0.24; Incorrect: 0.24; Best performance in follow-up and QoL; Worst performance in diagnosis and treatment |
Ozgor et al. (2024)19 | GPT-4 | Medical knowledge | Genitourinary cancer | 210; OE | NR | FAQs on diagnosis; treatment; aetiology; follow-up from various sources vs. prompts derived from GL | 3 Exp. |
5-point Likert-scale Quality; Global Quality Score (GQS); 5 = highest |
GQS score of 5 for prostate cancer: 0.646; for bladder cancer: 0.629; for kidney cancer: 0.681; and for testicular cancer: 0.639; Mean GQS score of towards GL questions was significantly lower than answers given to FAQs. Performance to questions aligned with the EAU guideline was deemed unsatisfactory. |
Sorin et al. (2023)20 | GPT-3.5 | Medical knowledge | Breast cancer | 10; OE | NR | LMM was given a detailed patient history and prompted to recommend treatment. Recommendations of LLM were compared to retrospective tumour board decisions. | 2 Exp. |
Multidimensional: 5-point Likert-scale for agreement with TB; summarization and explanation. |
Agreement with TB 70%; Mean scores for the first reviewer were summarization: 3.7; recommendation: 4.3; and explanation: 4.6; Mean scores for the second reviewer were summarization: 4.3; recommendation: 4.0; and explanation: 4.3; LLM showed tendency in some cases to overlook important information about the patient |
Lukac et al. (2023)21 | GPT-3.5 | Medical knowledge | Breast cancer | 10; OE | NR | LMM was given a detailed patient history and prompted to recommend treatment. Recommendations of LLM were compared to retrospective tumour board decisions. | NR | Agreement (point-based scale 0–2) | 0.16 of outputs congruent with TB. LLM provided mostly generalized answers; the current version is not able to provide specific recommendations for the therapy of patients with primary breast cancer. |
Gebrael et al. (2023)22 | GPT-4 | Medical knowledge | Metastatic prostate cancer | 56; OE | NR | LMM was given a detailed patient history of patients presented to the emergency ward with metastatic prostate cancer; LLM was prompted to decide to discharge or admit. | NR | Sensitivity and specificity of GPT-4 in determining whether a patient required admission or discharge. | LLM sensitivity in determining admission: 0.957; LLM specificity in discharging patients: 0.182. Findings suggest that GPT-4 has the potential to assist health providers in improving patient triage in emergency |
Holmes et al. (2023)23 |
GPT-3.5; GPT-4; Bard; BLOOMZ |
Medical knowledge | Radiation oncology | 100; MCQ | 5 | Authors tested different LLM on radiation oncology exam bank questions with 5 different Context and Instruction templates. Also; results were compared with human performance. |
9 Exp.; 6 non-Exp. |
Accuracy (number of correct responses) | GPT-4 outperformed all other LLMs and medical physicists, on average; with 67% of correct answers. |
Haver et al. (2023)46 |
GPT-3.5; GPT-4; Bard |
Translation/summary | Lung cancer | 19; OE | 3 | Evaluated the use of three LLMs for simplifying LLM-generated responses to common questions about lung cancer and lung cancer screening. | 3 Exp. | Readability FRE*; FKRG* | GPT-3.5’s baseline responses to lung cancer and LCS questions were challenging to read. Simplified responses from all three LLMs (GPT-3.5, GPT-4, Bard) enhanced readability, with Bard showing the most improvement. However, the average readability of these simplified responses still exceeded an eighth-grade level, too complex for the average adult patient. |
Choo et al. (2024)24 | GPT-3.5 | Medical knowledge | Colorectal cancer | 30; OE | NR | LMM was given a detailed patient history and prompted to recommend treatment. Recommendations of LLM were compared to retrospective tumour board decisions. | NR |
4-point Likert-scale; Concordance with TB |
Results deemed satisfactory; with concordance between LLM and tumour board 0.733. LLM recommendations did not match TB in 0.13 |
Haemmerli et al. (2023)25 | GPT-3.5 | Medical knowledge | Brain cancer | 10; OE | NR | LLM was prompted with a detailed patient history to recommend treatment, which was then evaluated by a rater. Interrater agreement was also assessed. | 7 Exp. | 10-point Likert scale used to rate agreement with LLM recommendations; intraclass correlation coefficient (ICC) measured interrater agreement. | LMM median responses: diagnosis—3, treatment—7, therapy regimen—6, overall agreement—5. Performance was poor for classifying glioma types, but good for recommending adjuvant treatments. Overall, there was moderate expert agreement, with an ICC of 0.7. |
Griewing et al. (2023)26 | GPT-3.5 | Medical knowledge | Breast cancer | 20; OE | NR | LLM was provided a detailed patient history and prompted to recommend treatments, which were then compared to retrospective tumour board decisions. The cases were designed to showcase the pathological and immune morphological diversity of primary breast cancer. | 13 Exp. | Number of treatment recommendations and concordance with TB | LLM proposed 61 treatment recommendations compared to 48 by experts, with the largest discrepancy in genetic testing. Overall concordance between LLM and experts was 0.5. LLM was deemed inadequate as a support tool for tumour boards. |
Benary et al. (2023)27 |
GPT-3.5; perplexity; BioMedLM; Galactica |
Medical knowledge | Various cancer entities | 10; OE | NR | Cases of advanced cancer with genetic alterations were submitted to four LLMs and one expert physician for personalized treatment identification. The concordance of LLM-generated treatments with the human reference was evaluated. | 1 Exp. | Categories: true positive (TP), false positive (FP), true negative (TN), false negative (FN). Likelihood of a treatment option originating from an LLM rated on a Likert-scale from 0 to 10 | LLMs proposed a median of 4 treatment recommendations with F1 scores of 0.04, 0.17, 0.14, and 0.19 across all patients. LLMs failed to match the quality and credibility of human experts. |
Davis et al. (2023)28 | GPT-3.5 |
Medical knowledge; summary/translation |
Oropharyngeal | 15; OE | NR | LLM outputs assessed for accuracy; comprehensiveness; and similarity. Readability assessed. Authors developed a new Score. Responses graded lower than an average of 3 were commented by raters. | 4 Exp. |
Multidimensional: 5-point Likert-scales for Accuracy;Comprehensiveness; Similarity; Readability (FRE*; FKGL*) |
LLM responses were suboptimal, with average accuracy: 3.88; comprehensiveness: 3.80; and similarity: 3.67. FRE and FKRGL scores both indicated higher than the 6th-grade level recommended for patients. Physician Comments: suboptimal education value and potential to misinform. |
Atarere et al. (2024)29 |
GPT-3.5; YouChat; Copilot |
Medical knowledge | Colorectal | 20; OE | 3 | 5 questions on important colorectal cancer screening concepts and 5 common questions asked by patients about diagnosis and treatment. LLM outputs compared to GL as a reference | 2 Exp. |
Binary for 2 dimensions: appropriateness (yes/no); reliability (yes/no) |
GPT-3.5 and YouChat reliably appropriate responses for screening: 1.0; Copilot reliably appropriate responses for screening: 0.867; GPT-3.5 reliably appropriate responses for common questions: 0.8; YouChat and Copilot reliably appropriate responses for common questions: 0.6 |
Rahsepar et al. (2023)45 |
GPT-3.5; Bard; search engines |
Medical knowledge | Lung | 40; OE | 3 | Prevention; screening; and terminology commonly used GL for Lung Imaging Reporting and Data System (Lung-RADS) as reference. Presented to LLMs as well as Bing and Google search engines as control. Answers were reviewed for accuracy and consistency between runs. | 2 Exp. | Accuracy on 4-point Likert-scale; Consistency (agreement between 3 runs) | GPT-3.5 responses were satisfactory with accuracy score 4: 0.708; Bard responses were suboptimal with accuracy score 4: 0.517; Bing responses were suboptimal with accuracy score 4: 0.617; Google responses were suboptimal with accuracy score 4: 0.55; GPT-3.5 and Google were most consistent; No tool answered all questions correctly and with full consistency |
Musheyev et al. (2024)30 |
GPT-3.5; perplexity; ChatSonic;Copilot |
Medical knowledge; summary/translation |
Genitourinary | 8; OE | NR | Top five search queries related to prostate; bladder; kidney; and testicular cancers according to Google Trends prompted to LLMs and Evaluated for quality; understandability; actionability; misinformation; and readability using validated published instruments. | NR |
Multidimensional: Quality (DISCERN*); Understandability and Actionability (PEMAT-P*); Misinformation (5-point Likert-scale); Readability (FKGL) |
LLMs responses had moderate to high information quality (median DISCERN score 4 out of 5; range 2–5) and lacked misinformation. Understandability was moderate (PEMAT-P understandability 66.7%; range 44.4–90.9%) and actionability was moderate to poor |
Pan et al. (2023)31 |
GPT-3.5; perplexity; ChatSonic;Copilot |
Medical knowledge; summary/translation |
Skin; lung; breast; colorectal; prostate |
20; OE | NR | Top five search queries according to Google Trends prompted LLMs and evaluated for quality; understandability; actionability; misinformation; and readability using validated published instruments. | 2 Exp. |
Multidimensional: Quality (DISCERN*); Understandability and Actionability (PEMAT-P*); Misinformation with GL as reference (5-point Likert-scale); Readability (FKGL*) |
LLMs performed satisfactory with median DISCERN score: 5; median PEMAT-P understandability score: 0.667; median PEMAT-P actionability score: 0.2; and no misinformation. Responses are not readily actionable and are written at too complex a level for patients. |
Huang et al. (2023)32 |
GPT-3.5; GPT-4 |
Medical knowledge | Radiation oncology | 293 MCQ | NR | Radiology (ACR) radiation oncology exam Grey Zone cases are used to benchmark the performance of GPT-4. | 1 Exp. |
Multidimensional: Correctness; Comprehensiveness (4-point Likert-scale); Novel aspects not mentioned by experts’ hallucinations (“present” vs. “not present”). |
GPT-4 outperformed GPT-3.5 with average accuracy: 0.788 vs. 0.621. Limitations deemed due to risk of hallucinations. |
Nguyen et al. (2023)33 |
GPT-4; Bard |
Medical knowledge | Various cancer entities | OE; SATA; PE | NR | Questions about cancer screening strategies were prompted in OE and SATA structure; authors compared differences in providing context or not. | 2 Stud. | Accuracy for open-ended prompts (score range 0–2) and select-all-that-apply prompts (score range 0–1) | GPT-4 and Bard average score for open-ended prompts: 0.83 and 0.7. GPT-4 and Bard average score for select-all-that-apply prompts: 0.85 and 0.82. PE enhanced LLM outputs for OE prompts but did not improve SATA responses. |
Iannantuono et al. (2024)44 |
GPT-3.5; GPT-4; Bard |
Medical knowledge; summary/ translation | Immuno- oncology | 60; OE | 3 | Evaluating questions to 4 domains of immuno-oncology (Mechanisms; Indications; Toxicities; and Prognosis) | 2 Exp. |
Accuracy (point-based scale 1–3); Relevance (point-based scale 1–3); Readability (point-based scale 1–3) |
GPT-3.5 and GPT-4 number of answered questions: 1.0; Bard number of answered questions: 0.53. Google Bard demonstrated relatively poor performance. Risk of inaccuracy or incompleteness was evident in all 3 LLMs, highlighting the importance of expert-driven verification. |
Liang et al. (2024)34 |
GPT-3.5; GPT-4; GPT-3.5 Turbo |
Medical knowledge | Genitourinary | 80; OE | 3 | Questions from urology experts were posed three times to both GPT-3.5 and GPT-4; Afterwards iterative fine-tuning with GPT-3.5 Turbo on the same question-set with and assessment of training outcomes. | NR | Binary accuracy (yes/no) | GPT-3.5 average accuracy: 67.08%: GPT-4 average accuracy: 77.50%; Both GPT-3.5 and GPT-4 were subject to instability in answering; GPT-3.5 Turbo stabilized average accuracy: 93.75%; With second iteration GPT-3.5 Turbo achieved 100% accuracy |
Marchi et al. (2024)35 | GPT-3.5 | Medical knowledge | Oropharyngeal | 68; OE | 2 | Questions on treatment; adjuvant treatment; and follow-up compared to GL (NCCN) as reference. Evaluated for sensitivity; specificity; and F1 score | 2 Exp. | Binary accuracy (yes/no) | Overall sensitivity: 100%; Overall accuracy: 92%; Overall F1-score: 0.93; Overall precision: 91.7%. |
Yeo et al. (2023)36 | GPT-3.5 | Medical knowledge; patient empowerment | Liver | 164; OE | 2 | Questions regarding knowledge; management; and emotional support for cirrhosis and HCC and assessed for accuracy and emotional support capacity | 2 Exp. | Accuracy (4-point Likert scale) |
LLMs responses were satisfactory with an accuracy: 0.74; LLM had the best performance in basic knowledge; lifestyle; and treatment. LLM encourages patients to follow treatment strategies; offer emotional support; and recommend patients to seek sources such as support groups in a structured manner. |
Hermann et al. (2023)37 | GPT-3.5 | Medical knowledge | Cervical cancer | 64; OE | NR | Questions on prevention; diagnosis; treatment and QoL were drafted from p official patient forum websites and the authors’ clinical experiences and evaluated for correctness and comprehensiveness. | 2 Exp. | Accuracy (4-point Likert scale) | LLM responses were satisfactory with correct and comprehensive: 0.531; correct and not comprehensive: 0.297; partially incorrect: 0.156; completely incorrect: 0.16; LLM performed best in “Prevention/QoL” and worst in “Diagnosis” |
Lechien et al. (2024)38 | GPT-4 | Medical knowledge | Oropharyngeal | 20; OE | 2 | Detailed patient history with head and neck cancer was evaluated for additional examinations; management; and therapeutic approaches and compared to the reference (TB decision). | 2 Exp. |
Multidimensional: AIPI Tool* |
GPT-4 was accurate in 13 cases (65%). Mean number of treatment recommendations proposed by LLM: 5.15; Mean number of treatment recommendations proposed by tumour board: 4. Test–retest showed mostly consistent LLM outputs |
Kuşcu et al. (2023)39 | GPT-4 | Medical knowledge | Oropharyngeal | 154; OE | 2 | Questions from various sources: official patient forum; institutions; patient support groups; and social media. Topics: basic knowledge; diagnosis; treatment; recovery; operative risks; complications; follow-up; and cancer prevention. | 2 Exp. |
Accuracy (4-point Likert scale) Reproducibility (number of similar responses) |
LLMs responses were satisfactory with an accuracy: 0.863. Reproducibility: 0.941 upon test–retest evaluation. |
Chung et al. (2023)47 | GPT-3.5 | Translation/summary | Prostate | 5; OE | 3 | Prompted to summarize five full MRI reports and evaluated for readability. Radiation oncologists were asked to evaluate the AI-summarized reports via an anonymous questionnaire. | 12 Exp. |
Accuracy (Likert-scale 1–5); Readability (FKGL*) |
LLM was able to simplify full MRI reports at or below a sixth-grade reading level (9.6 vs. 5.0); Median word count was reduced from 464 (full MRI) to 182 (LLM). Summaries were deemed appropriate for patients. |
Choi et al. (2024)40 | GPT-3.5 | Medical knowledge | Kidney cancer | 10; OE | NR | FAQs drafted and evaluated for the service quality The survey was distributed to 103 urologists via email; and 24 urological oncologists. | 103 Exp. | Service quality (SERVQUAL*) | Mean positive evaluation rate: 0.779; Positive scores for the overall understandability: 0.542; LLM could not replace explanations provided by experts: 0.708 |
Dennstädt et al. (2024)41 | GPT-3.5 | Medical knowledge | Radiation oncology | 70 MCQ; 25 OE | NR | Multiple-choice questions about clinical; physics; and biology general knowledge and evaluated for correctness and usefulness | 6 Exp. |
Accuracy (5-point Likert scale); Usefulness (5-point Likert scale) |
LLM valid responses in multiple-choice questions: 0.943; LLM very good responses in open-ended questions: 0.293 |
Wei et al. (2024)42 | GPT-4 | Medical knowledge; summary/translation | Oropharyngeal | 49; OE | NR | Commonly asked questions about head and neck cancer were obtained and inputted into both GPT-4 and Google search engines. | 2 Exp. |
Quality (5-point Likert scale using EQIP Tool*) Readability (FRE*; FKGL*) |
Google sources received significantly higher quality scores than LLM (4.2 vs. 3.6); No significant difference between the average reading ease score for Google and LLM (37.0 vs. 33.1) or the average grade level score for Google and LLM (14.2 vs. 14.3) |
Lee et al. (2023)43 | GPT-3.5 | Medical knowledge; summary/translation | Oropharyngeal | NR | NR | Generated from presurgical educational information, including indications; risks; and recovery common surgical questions. Evaluated for thoroughness; Inaccuracy and readability and compared to search engine. | 5 Exp. | Accuracy; Thoroughness (10-point Likert scale); Inaccuracy (number of errors); Readability (FRE*; FKGL*) | LLM and Google showed similar accuracy (mean 7.7 vs. 8.1) and thoroughness (mean 7.5 vs. 7.3), with few medical errors (mean 1.2 vs. 1.0). Readability was comparable between both tools. Experts preferred Google 52% of the time. |
OE open-ended, SATA Select-All-That-Apply, n.a. not applicable, c.i. clinical information, Exp expert, Rv reviewer, NR not reported, PE prompt engineering, GL Guideline, MCQ multiple choice question, NCI National Cancer Institute, * validated assessment tools, FRE Flesch reading ease, FKRG Flesch–Kincaid reading grade, DISCERN, ServQUAL, PEMAT-P, EQIP and AIPI are validated tools.