Table 1. Oncologist Scoring of LLM Chatbot Treatment Recommendationsa.
Scoring criteria | Agreement across 4 unique prompts, No. (%) (n = 26)b | All prompts, No. (%) (n = 104) | Prompts by cancer type | Prompts by extent of disease | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Breast cancer, No. (%) (n = 20) | Lung cancer, No. (%) (n = 20)c | Non–small cell lung cancer, No. (%) (n = 20) | Small cell lung cancer, No. (%) (n = 12) | Prostate cancer, No. (%) (n = 32) | Not specified, No. (%) (n = 20) | Localized, No. (%) (n = 64) | Advanced, No. (%) (n = 20) | |||
1. How many treatment recommendations were provided? | ||||||||||
0 | 22 (84.6) | 3 (2.9) | 1 (5.0) | 1 (5.0) | 0 | 0 | 1 (3.1) | 0 | 3 (4.7) | 0 |
1 | 2 (1.9) | 0 | 0 | 0 | 0 | 2 (6.3) | 0 | 2 (3.1) | 0 | |
≥1 | 99 (95.2) | 19 (95.0) | 19 (95.0) | 20 (100) | 12 (100) | 29 (90.6) | 20 (100) | 59 (92.2) | 20 (100) | |
2. How many of the recommended treatments were in accordance with 2021 NCCN guidelines? c | ||||||||||
0 | 9 (34.6) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Some but not all | 35 (33.7) | 2 (10.0) | 6 (30.0) | 8 (40.0) | 7 (58.3) | 12 (37.5) | 2 (10.0) | 26 (40.6) | 7 (35.0) | |
All | 66 (63.5) | 17 (85.0) | 13 (65.0) | 12 (60.0) | 5 (41.7) | 19 (59.4) | 18 (90.0) | 35 (54.7) | 13 (65.0) | |
NA | 3 (2.9) | 1 (5.0) | 1 (5.0) | 0 | 0 | 1 (3.1) | 0 | 3 (4.7) | 0 | |
3. If some but not all to question 2, how many recommendations were correct in their entirety per 2021 NCCN guidelines? | ||||||||||
0 | 11 (42.3) | 1 (0.9) | 0 | 0 | 0 | 1 (8.3) | 0 | 0 | 1 (1.6) | 0 |
≥1 | 33 (31.7) | 1 (5.0) | 7 (35.0) | 8 (40.0) | 6 (50.0) | 11 (34.4) | 2 (10.0) | 25 (39.1) | 6 (30.0) | |
NAd | 70 (67.3) | 19 (95.0) | 13 (65.0) | 12 (60.0) | 5 (41.7) | 21 (65.6) | 18 (90.0) | 38 (59.4) | 14 (70.0) | |
4. How many recommended treatments were hallucinated? | ||||||||||
0 | 16 (61.5) | 91 (87.5) | 19 (95.0) | 18 (90.0) | 19 (95.0) | 9 (75.0) | 26 (81.3) | 19 (95.0) | 55 (85.9) | 17 (85.0) |
≥1 | 13 (12.5) | 1 (5.0) | 2 (10.0) | 1 (5.0) | 3 (25.0) | 6 (18.8) | 1 (5.0) | 9 (14.1) | 3 (15.0) | |
5. If >1 to question 4, how many of the hallucinated treatments are now a recommended treatment in the most current NCCN guidelines? | ||||||||||
0 | 16 (61.5) | 13 (12.5) | 1 (5.0) | 2 (10.0) | 1 (5.0) | 3 (25.0) | 6 (18.8) | 1 (5.0) | 9 (14.1) | 3 (15.0) |
≥1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
NA | 91 (87.5) | 19 (95.0) | 18 (90.0) | 19 (95.0) | 9 (75.0) | 26 (81.3) | 19 (95.0) | 55 (85.9) | 17 (85.0) |
Abbreviations: LLM, large language model; NA, not applicable; NCCN, National Comprehensive Cancer Network.
Data reported as No. (%) using majority rule of annotators’ scores.
Diagnosis descriptions for which the output of each of the 4 prompts for a given diagnosis description yielded the same score by the rating oncologists.
Lung cancer was queried separately from non–small cell lung cancer and small cell lung cancer.
Slight misalignment of categorical scores from questions 2 and 3 resulted from majority rules. For example, for question 3, NA = 70 instead of 69 (66 + 3) because of majority voting.